May 17

Add to Calendar 2017-05-17 11:30:00 2017-05-17 13:00:00 America/New_York Update on Computational Approaches to Autism Research Isaac Kohane, MD, PhD is the inaugural Chair of the Department of Biomedical Informatics at Harvard Medical School. He develops and applies computational techniques to address disease at multiple scales: From whole healthcare systems as “living laboratories” to the functional genomics of neurodevelopment with a focus on autism. Kohane’s i2b2 project is currently deployed internationally to over 120 major academic health centers to drive discovery research in disease and pharmacovigilance (including providing evidence on drugs which ultimately contributed to “black box”ing by the FDA). Dr. Kohane has published several hundred papers in the medical literature and authored a widely used book on Microarrays for an Integrative Genomics. He is a member of the Institute of Medicine and the American Society for Clinical Investigation. Stata 32-G575

May 10

Add to Calendar 2017-05-10 11:30:00 2017-05-10 13:00:00 America/New_York Genomic Databases and Genomic Privacy: Can we have the best of both worlds? Widespread scientific curiosity and the possibility of medical benefits make it nearly impossible to justify keeping genomic data under lock and key. At the same time, there has been increased concern about the private information leaked by de-identified genomic databases. What is the right middle ground?In the first half of this talk, I will present some ongoing work investigating privacy risks in a publicly available genomic database, and possible approaches to help mitigate these concerns. The second half of the talk will focus on how ideas from Bayesian statistics and the statistical disclosure control literature might help turn these mitigation techniques into quantitative guarantees. Stata 32-G575

May 03

Add to Calendar 2017-05-03 11:30:00 2017-05-03 13:00:00 America/New_York Electronic medical record phenotyping using the anchor and learn framework Electronic medical records (EMRs) hold a tremendous amount of information about patients that is relevant to determining the optimalapproach to patient care. As medicine becomes increasingly precise, a patient’s electronic medical record phenotype will play an importantrole in triggering clinical decision support systems that can deliver personalized recommendations in real time. In this talk, I introduce our recently developed "anchor and learn" framework for efficient lylearning statistically driven phenotypes with minimal manual intervention. Using this approach, we developed a phenotype library that uses both structured and unstructured data from the EMR torepresent patients for real-time clinical decision support. The resulting phenotypes are interpretable and fast to build. Evaluated inan emergency department setting, we find that our semi-supervised learning approach (which uses no manually labeled data) performs comparably to supervised learning.Based on joint work with Yoni Halpern and Steven Horng. 32-G575

April 26

Add to Calendar 2017-04-26 11:30:00 2017-04-26 13:00:00 America/New_York Network-based approaches to identify disease-relevant mechanisms In this talk I will give 2 examples of the kind of projects we work on as part of Computational and Systems Biology at Biogen. Both examples highlight the use of human data to drive target identification and in building therapeutic hypotheses. In the first example, I will describe a method to identify central “dysregulated” genes from transcriptional data. The method can point to potential non-transcriptional events upstream of the observed transcriptional changes. An application of this method to a human ALS dataset reveals mechanisms that might be potential candidates for therapeutic intervention. In the second example, I will describe our efforts to learn a transcriptional network underlying context-specific eQTLs observed in human monocytes. Analysis of these simulated networks helps identify transcriptional connections important for activation of monocytes, which can then be exploited to identify targets upstream of disease-specific signatures. Time permitting, I will briefly touch upon our ongoing efforts in integrating phenotypic, molecular and literature data to construct therapeutic hypotheses for diseases of interest. 32-G575

April 19

Add to Calendar 2017-04-19 11:30:00 2017-04-19 13:00:00 America/New_York FIDDLE: An integrative deep learning framework for functional genomic data inference Numerous advances in sequencing technologies have revolutionized genomics throughgenerating many types of genomic functional data. Statistical tools have been developed toanalyze individual data types, but there lack strategies to integrate disparate datasets under aunified framework. Moreover, most analysis techniques heavily rely on feature selection anddata preprocessing which increase the difficulty of addressing biological questions through theintegration of multiple datasets. Here, we introduce FIDDLE (Flexible Integration of Data with Deep LEarning) an open source data-agnostic flexible integrative framework that learns aunified representation from multiple data types to infer another data type. As a case study, weuse multiple Saccharomyces cerevisiae genomic datasets to predict global transcription startsites (TSS) through the simulation of TSS-seq data. We demonstrate that a type of data can beinferred from other sources of data types without manually specifying the relevant features andpreprocessing. We show that models built from multiple genome-wide datasets perform profoundly better than models built from individual datasets. Thus FIDDLE learns the complex synergistic relationship within individual datasets and, importantly, across datasets. 32-G575

April 12

Add to Calendar 2017-04-12 11:30:00 2017-04-12 13:00:00 America/New_York Anomaly Detection for Precision Medicine This talk will discuss the anomaly detection machine learning paradigm as a model for precision medicine. The idea is to characterize molecular data from an individual patient in a clinically-meaningful way by comparing it to a large population of control samples. Such one-sided learning allows for the identification of rare disorders or the characterization of common but molecularly heterogeneous ones. However, existing anomaly detection methods, particularly distance-based methods, fare poorly on problems with the dimensions and characteristics of genomic data.We therefore introduce our results from developing a feature prediction approach that works well on both traditional machine learning data sets and gene expression data. We will discussmethod evaluation and describe further work extending the anomaly detection model to incorporate functional interpretation. We will show how this approach has led to the discovery of new information about developmental disorders. Finally, we will discuss the relevance of such methods for characterizing sequence variation, which requires revisiting the issue of scalability, and present some recent results suggesting the feasibility of this approach. 32-G575

April 10

Add to Calendar 2017-04-10 11:30:00 2017-04-10 13:00:00 America/New_York Decoding epigenetic programs in cellular differentiation and T cell dysfunction in tumors Dysregulated epigenetic developmental programs are a feature of many cancers, and the diverse differentiation states of immune cells as well as their dysfunctional states in tumors are in part epigenetically encoded. We developed an integrative computational strategy to exploit genome-wide data on chromatin accessibility (DNase-seq or ATAC-seq), histone modifications (ChIP-seq), and transcription (RNA-seq) in order to study enhancer landscape and gene expression dynamics in cellular differentiation, with a focus on the hematopoietic system. We examined how early establishment of enhancers and complex regulatory locus control together govern gene expression changes in cell state transitions. We also developed a quantitative model to predict gene expression changes from the DNA sequence content and lineage history of active enhancers. We are now using these methods to study the chromatin dynamics of tumor-specific T cells (TSTs) in tumors. We profiled the gene expression (RNA-seq) and chromatin accessibility (ATAC-seq) landscapes of functional T cells and dysfunctional TSTs using a mouse genetic system where the SV40 large T antigen serves as both the oncogenic driver and the neoantigen. We showed that TSTs differentiate to dysfunctionality through two discrete chromatin states: an initial plastic state that can be functionally rescued and a later fixed state that is resistant to therapeutic reprogramming. We computationally identified transcription factors (TFs) with global changes in binding site accessibility during progression to the fixed dysfunctional state and found that in vivo pharmacological modulation of identified TFs decreases or delays dysfunction. We also showed that patient-derived PD1 high tumor infiltrating lymphocytes (TILs) display an epigenetic signature of fixed dysfunction, and we defined novel cell surface markers that demarcated reprogrammable TSTs within the heterogeneous TIL population, a finding of potential clinical significance. 32-G575

April 05

Add to Calendar 2017-04-05 11:30:00 2017-04-05 13:00:00 America/New_York Algorithms for Inferring Evolution and Migration of Tumors Cancer is an evolutionary process driven by somatic mutations that accumulate in a population of cells that form a primary tumor. In later stages of cancer progression, cells migrate from a primary tumor and seed metastases at distant anatomical sites. I will describe algorithms to reconstruct this evolutionary process from DNA sequencing data of tumors. These algorithms address challenges that distinguish the tumor phylogeny problem from classical phylogenetic tree reconstruction, including challenges due to mixed samples and complex migration patterns. 32-G575

March 22

Add to Calendar 2017-03-22 11:30:00 2017-03-22 13:00:00 America/New_York Learning representations of protein from sequence, structure, and network Understanding of protein structure and protein-protein interaction is crucial for studying molecular pathways and gaining insights into various biochemical processes. Data-driven approaches for predicting protein structure and interaction have been recently improved, partially due to the advances in machine learning. The success of machine learning algorithms often depends on data representation, which encodes explanatory factors of variation behind the data. Although our prior knowledge in protein science can help design good representations of proteins, powerful techniques capable of identifying protein patterns and sharing insights across diverse datasets are needed. In this talk, I will discuss three recent work on learning protein representations from sequence, structure and network data. First, I will introduce DeepContact, a deep convolutional neural-network (CNN) based approach that identifies conserved structural motifs, automatically and effectively leveraging patterns of residue-residue contacts to enable accurate inference of contact probabilities. Second, I will discuss DeepFold, another CNN-based approach to extract structural motifs within protein structure to enable accurate and efficient alignment-free structure search. Lastly, I will present Mashup, a feature learning algorithm to integrate protein-protein interaction networks for functional inference. In addition to the state-of-the-art performance, we expect these representation learning algorithms to provide biologically meaningful and deep insights into the organizational structure of protein folds and interaction networks. 32-G575

March 20

Add to Calendar 2017-03-20 11:30:00 2017-03-20 13:00:00 America/New_York Perturb-Seq: Dissecting Cellular Circuits with Single Cell RNA-Seq AVIV REGEV, a computational and systems biologist, is a professor of biology at MIT, a Howard Hughes Medical Institute Investigator, and the Chair of the Faculty and the director of the Klarman Cell Observatory and Cell Circuits Program at the Broad Institute. She studies the molecular circuitry that governs the function of mammalian cells in health and disease and has pioneered many leading experimental and computational methods for the reconstruction of circuits, including in single-cell genomics.Regev is a recipient of the NIH Director’s Pioneer Award, a Sloan fellowship from the Sloan Foundation, the Overton Prize from the International Society for Computational Biology, the Earl and Thressa Stadtman Scholar Award from the American Society of Biochemistry and Molecular Biology, and is a 2016 ISCB Fellow. Regev received her M.Sc. from Tel Aviv University, studying biology, computer science, and mathematics in the Interdisciplinary Program for the Fostering of Excellence. She received her Ph.D. in computational biology from Tel Aviv University. 32-G575

March 08

Add to Calendar 2017-03-08 11:30:00 2017-03-08 13:00:00 America/New_York Studying reuse patterns in the protein universe We study the global nature of protein space, and study reuse patterns within it. To do so, we represent all similarities among a set of representative structures as a graph where edges connect proteins that share significantly sized segments of similar sequence and structure. This graph offers a way to organize protein space, and examine how the definition of “evolutionary relatedness” influences it. At excessively strict thresholds the graph “falls apart”; for very lax thresholds, there are paths between virtually all nodes. Interestingly, at intermediate thresholds the graph has two regions: "discrete" versus “continuous.” The discrete region consists of isolated islands, each generally corresponding to a fold; the continuous region is dominated by domains with alternating alpha and beta elements. Considering such a graph for two representative sets of ECOD domains and PDB chains, we study reuse patterns in protein space. Reuse (described by the edges in the graph) has a clear evolutionary advantage over 'design from scratch', where most newly-formed segments are not even foldable. The best characterized form of sequence reuse is structural domains, where the shared parts are of 100 amino acids on average. To systematically explore reuse in proteins, we develop a DP algorithm that derives the most reused non-overlapping segments of a domain/protein from the set of its alignments to other domains/proteins. This allows us to automatically identify shared ‘themes’, segments of 35 residues or more that are similar in sequence and structure. We show that reuse prevails at all levels, and that it increases with the decrease in length of the themes. In this respect, domains are just one of many forms of reuse in proteins, i.e., a special case of themes. The observed behavior is consistent with evolution by divergence, duplication, and mutations, consolidating the suggestion that proteins have evolved from ancestral amino acid segments. Indeed, some of our themes could be the descendants of these ancestral segments. I'll be describing new work with Sergey Nepomnyachiy and Nir Ben-Tal (to be published still), but also mention older projects: CyToStruct: augmenting the network visualization of cytoscape with the power of molecular viewers, Nepomnyachiy, Ben-Tal, Kolodny, Structure (2015) and Global view of the protein universe, Nepomnyachiy, Ben-Tal,Kolodny, PNAS (2014). 32-G575

March 01

Add to Calendar 2017-03-01 11:30:00 2017-03-01 13:00:00 America/New_York Challenges in identifying cancer genes in the face of inter and intra tumor heterogeneity Massively parallel sequencing has permitted an unprecedented examination of cancer genomes, leading to predictions that all genes important to cancer may soon be identified by genetic analysis of tumours from sufficiently large patient cohorts. In this presentation I will explore evidence suggesting this promise may have been premature. I will present our evaluation of the ability of state-of-the-art sequence analysis methods to recover known cancer genes. While some cancer genes are identified by analysis of recurrence, spatial clustering, or predicted impact of somatic mutations, many remain undetected due to lack of power to discriminate driver mutations from background. Furthermore, cancer genes not detected by mutation recurrence tend to be missed by other types of analysis of patient cohorts. Nonetheless, undetected genes are implicated by other experiments such as functional genetic screens and expression profiling. I will examine ways by which such genes elude detection due to inter and intra patient heterogeneity; first, due to gene dependency effects in inherited germline and somatic mutations, and second due to transcriptional heterogeneity within tumors. Finally, I will present preliminary findings from unbiased transcriptional profiling of single cells from multiple colon cancer tumors. Our examination of expression subtypes derived from bulk RNA expression analysis of colon cancer patient cohorts, showed that different subpopulations of cells within a single tumor have high expression of genes related to different subtypes, and a single tumor may be comprised of a heterogenous mixture of subtypes. These results are suggestive that current monolithic classification of cancer into types may be unable to capture the transcriptional heterogeneity in cancer, presenting an additional challenge to comprehensive ‘omic charting of individual tumors. 32-G575

February 22

Add to Calendar 2017-02-22 11:30:00 2017-02-22 13:00:00 America/New_York Accelerating precision cancer medicine through clinical computational oncology The increasing ability to generate vast data from a patient’s tumor and germline DNA at the point-of-care has required the development of a scientific discipline to harness this information avalanche for clinical and translational purposes: clinical computational oncology. This field, defined by the development and application of algorithms that analyze and interpret multi-omic data directly from patients to tackle scientific questions grounded in clinical care, has catalyzed new discoveries that directly impact precision cancer medicine strategies in many ways. This includes 1) prospective identification of patient-specific genomic features linked to therapeutic actions to guide individualized care, 2) definition of genomic mediators of response to existing and emerging cancer therapies, and 3) discovery of clinical tumor evolution mechanisms in multiple therapeutic resistance scenarios. In this presentation, we will present new advances in each of these areas of ongoing research, spanning multiple cancer types and clinical contexts. Broadly, the initiation and development of clinical computational oncology has significantly accelerated translational and clinical oncology discovery, whereby bioinformatics methodologies driven by clinically grounded investigators are leveraged specifically for clinical use at the point of care. Future efforts that expand this science across cancers and populations will inform their utility for precision cancer medicine. G575

February 15

Add to Calendar 2017-02-15 11:30:00 2017-02-15 13:00:00 America/New_York Functional interpretation of genomes using biological networks The recent explosion in genome-wide association studies and exome sequencing projects have revealed many genetic variants likely to be involved in disease processes, but the composition and function of the tissue-specific molecular systems they affect remain largely obscure. This limits our progress towards biological understanding and therapeutic intervention. Computational analyses that systematically integrate biological networks (i.e., networks in which genes are connected if they are functionally associated in some experimental system) with genetic data have emerged as a powerful and scalable approach to functionally interpret very large genomic data sets by enabling the identification of de novo pathways perturbed in disease. This talk will highlight approaches and methods we have developed in this area, and exemplify how different network-based methods have been used to analyze common and rare genetic variants to deduce the molecular networks perturbed by genetics in a wide range of diseases. As a general model for how in silico networks can be expanded, consolidated and validated, I will show how protein-protein interaction networks involved in human arrhythmias were elucidated and validated by combining, GWAS, quantitative interaction proteomics, electrophysiology and model organisms through rigorous statistical frameworks. I will also illustrate how we are now applying this approach to cancers as well as neuropsychiatric and cardiovascular diseases. 32-G575