November 01

Add to Calendar 2018-11-01 16:00:00 2018-11-01 17:00:00 America/New_York Dissecting genome circuitry: new mechanistic insights on regulatory mechanisms underlying cellular behavior. Dissecting genome circuitry: new mechanistic insights on regulatory mechanisms underlying cellular behavior. Harmen Bussemaker, Columbia UniversityThursday, November 1st, 2018 at 4pm-5pmStar Conference Room, 4th floor, 32D-463MIT Stata Center, 32 Vassar St, Cambridge, MA 02139Host: Manolis Kellis, kellis-admin@mit.edu 617-253-3497Transcription-factor-centric biophysical and statistical modeling of high-throughput functional genomics data can yield new mechanistic insight into regulatory mechanisms underlying cellular behavior. (1) No Read Left Be-hind (NRLB), a feature-based maximum likelihood algorithm for analyzing SELEX data, allows quantifying the binding specificity of transcription factor complexes almost perfectly over a >100-fold affinity range and an unlimited binding site footprint. NRLB can accurately predict the effect on embryonic enhancer activity when ultra-low-affinity Exd-Hox binding sites that are 300-fold weaker than the highest-affinity site in the genome are mutated. (2) Our extension of the SELEX-seq assay to barcoded mixtures of methylated and unmethylated DNA libraries reveals that CpG methylation can affect binding by human Pbx-Hox complexes positively or negatively, depending on where the CpG is located relative to the binding interface. We find that in vitro and in vivo binding by the p53 tetramer can be stabilized by cytosine methylation. (3) Our comprehensive integrative analysis of gene regulatory networks driving aging and longevity implicates an unknown zinc finger protein as a key antagonist of FoxO3a, and show that siRNA knockdown of this transcription factor in human cells leads to significantly increased nuclear localization of FoxO3a.Dr. Harmen Bussemaker is a Professor in the Department of Biological Sciences and Department of Systems Biology at Columbia University. His lab uses biophysical and statistics approaches to decode gene expression regulation and gene-regulatory sequences, by integration of different genomics data types. Dr. Bussemaker has received the ISMB Ian Lawson Van Toch Memorial Award, the RECOMB Insight Award, the John Simon Guggenheim Fellowship, and the Lenfest Distinguished Columbia Faculty Award. He has co-organized the CSHL Systems Biology: Global Regulation of Gene Expression conference, and taught the CSHL course on integrative data analysis for high-throughput biology. He is on the editorial board of BMC Systems Biology, and an associate editor of PLoS Com-putational Biology, and serves as ad hoc reviewer for Science, Nature, PNAS, Genome Research, Nature Genetics, and others. He studied with Eric Siggia at Rockefeller University.

May 16

Add to Calendar 2018-05-16 11:30:00 2018-05-16 13:00:00 America/New_York Community Detection in Biological Networks: Lessons from the DREAM 2016 Module Identification Challenge The 2016 DREAM Disease Module Identification Challenge was developed to systematically assess the state of computational module identification methods on a diverse collection of molecular networks. Six different anonymized networks were presented with the gene names anonymized. The goal was to partition the genes into non-overlapping modules of from 3-100 genes each, based soley on thepatterns of network connectivity. Collections of modules were scored based on the number of modules that were statistically significantlyenriched for a set of trait or disease-related phenotypes, according to a set of previously published GWAS datasets. For the first subchallenge,gene names were anonymized separately for each network and it asked for modules in each of the six networks considered separately; the second subchallenge used the same identifier across networks and asked for one collection of modules integrating information together from all six networks. 32-G575

May 09

Add to Calendar 2018-05-09 11:30:00 2018-05-09 13:00:00 America/New_York From genome-scale to ecosystem-level models of microbial metabolism Abstract: Metabolism, in addition to being the “engine” of every living cell, plays a major role in the cell-cell and cell-environment relations that shape the dynamics and evolution of microbial communities, e.g. by mediating competition and cross-feeding interactions between different species. Despite the increasing availability of metagenomic sequencing data for numerous microbial ecosystems, fundamental aspects of these communities, such the maintenance of diversity, the unculturability of many isolates, and the conditions necessary for taxonomic or functional stability, are still poorly understood. In our lab, we develop and test mechanistic computational models for the dynamics and evolution of interactions between different organisms based on the knowledge of their entire metabolic networks, with applications in the study of natural and synthetic microbial communities. 32-G575

May 02

Add to Calendar 2018-05-02 11:30:00 2018-05-02 13:00:00 America/New_York Impact of negative selection on the genetic architecture of diseases and complex traits Abstract: It is widely known that negative selection against geneticvariants that reduce fitness causes them to be enriched forlower-frequency variants, so that lower-frequency variants have largercausal effects on diseases and complex traits. Here, we explore twoother ways in which negative selection impacts disease and traitarchitectures. First, we show that (conditional on minor allelefrequency) variants with low levels of linkage disequilibrium havelarger causal effects. We show that much of this signal can beexplained by the fact that (conditional on minor allele frequency)more recent variants have larger causal effects, since negativeselection has had less time to remove them; for example, the youngest20% of common variants explain 4x more heritability than the oldest20% of common variants. Second, we show that functional annotationsstrongly impacted by negative selection have larger enrichment forlow-frequency variant heritability compared to their enrichment forcommon variant heritability, both empirically and in forwardsimulations. Cell-type-specific regulatory annotations that areenriched for common variant heritability tend to be similarly enrichedfor low-frequency variant heritability for most annotations andtraits, but more enriched for brain-related annotations and traits.For example, H3K4me3 marks in brain DPFC explain 57±12% oflow-frequency variant heritability vs. 12±2% of common variantheritability for neuroticism, implicating the action of negativeselection on low-frequency variants affecting gene regulation in thebrain. 32-G575

April 25

Add to Calendar 2018-04-25 11:30:00 2018-04-25 13:00:00 America/New_York Translational Genomics and Precision Medicine in Metastatic Breast Cancer ABSTRACT: Over the past decade, genomic characterization of tumors from cancer patients obtained through large-scale sequencing projects has shed enormous light on the molecular underpinnings of cancer. These discoveries have, in turn, led to the development of novel therapies and preventive measures that have already revolutionized cancer care. Despite this tremendous progress, there remains much more that we need to develop better treatments for metastatic cancer. In this presentation, we will discuss translational genomics approaches and precision medicine strategies for metastatic breast cancer. We will focus on how genomic and molecular characterization of metastatic tumor samples from patients sheds light on the biology of metastatic breast cancer and enables the development of strategies to overcome or prevent drug resistance. We will discuss the landscape of mutations in metastatic breast cancer, the potential clinical impact of genomic profiling, and the incorporation of genomic information into clinical care and clinical trials. We will also discuss how partnering directly with patients, in projects such as The Metastatic Breast Cancer Project (mbcproject.org), enables rapid identification of large numbers of patients willing to share tumors, saliva, and medical records to accelerate this research. 32-G575

April 18

Add to Calendar 2018-04-18 11:30:00 2018-04-18 13:00:00 America/New_York From Genetics To Therapeutics: Uncovering And Manipulating The Circuitry Of Non-coding Disease Variants Perhaps the greatest surprise of human genome-wide association studies (GWAS) is that 90% of disease-associated regions do not affect proteins directly, but instead lie in non-coding regions with putative gene-regulatory roles. This has increased the urgency of understanding the non-coding genome, as a key component of understanding human disease. To address this challenge, we generated maps of genomic control elements across 127 primary human tissues and cell types, and tissue-specific regulatory networks linking these elements to their target genes and their regulators. We have used these maps and circuits to understand how human genetic variation contributes to disease and cancer, providing an unbiased view of disease genetics and sometimes re-shaping our understanding of common disorders. For example, we find evidence that genetic variants contributing to Alzheimer’s disease act primarily through immune processes, rather than neuronal processes. We also find that the strongest genetic association with obesity acts via a master switch controlling energy storage vs. energy dissipation in our adipocytes, rather than through the control of appetite in the brain. We also combine genetic information with regulatory annotations and epigenomic variation across patients and healthy controls to discover new disease genes and regions with roles in Alzheimer’s disease, heart disease, prostate cancer, and to understand their pleiotropic effects by integration with electronic health records. Lastly, we develop systematic technologies for systematically manipulating these circuits by high-throughput reporter assays, genome editing, and gene targeting in human cells and in mice, demonstrating tissue-autonomous therapeutic avenues in Alzheimer’s disease, obesity, and cancer. These results provide a roadmap for translating genetic findings into mechanistic insights and ultimately therapeutic treatments for complex disease and cancer. 32-G575

April 11

Add to Calendar 2018-04-11 11:30:00 2018-04-11 13:00:00 America/New_York Sequence, structure and network methods to uncover cancer genes A major aim of cancer genomics is to pinpoint which somatically mutated genes are involved in tumor initiation and progression. This is a difficult task, as numerous somatic mutations are typically observed in each cancer genome, only a subset of which are cancer-relevant, and very few genes are found to be somatically mutated across large numbers of individuals. In this talk, I will overview three methods my group has introduced for identifying cancer genes. First, I will present a framework for uncovering cancer genes, differential mutation analysis, that compares the mutational profiles of genes across cancer genomes with their natural germline variation across healthy individuals. Next, I will show how to leverage per-individual mutational profiles within the context of protein-protein interaction networks in order to identify small connected subnetworks of genes that, while not individually frequently mutated, comprise pathways that are altered across (i.e., “cover”) a large fraction of individuals. Finally, I will demonstrate that cancer genes can be discovered by identifying genes whose interaction interfaces are enriched in somatic mutations. Overall, these methods recapitulate known cancer driver genes, and discover novel, and sometimes rarely-mutated, genes with likely roles in cancer. 32-G575

April 04

Add to Calendar 2018-04-04 11:30:00 2018-04-04 13:00:00 America/New_York Deeper understanding of microbiomes as a benefit of forgetting microbial names Microbes run the world... but they don't care for the names we give them. Molecular functional abilities of individual microbes in micriobiomes living in different environmental conditions are clearly different. Thus, the question of “who is there?” is not as relevant as “what are they doing?”Focusing on microbial molecular functionality, instead of names or classes, allows for a better description of microbial and microbiome-ial abilities and similarities. The recent emergence of high-throughput genomic sequencing, coupled with the growing analytical capacities, has unlocked new horizons in our understanding of the microbial world. However, making sense of this deluge of data requires efficient and accurate computational techniques. The identification of microbial clades resident in a particular niche is only an estimate of the microbiome’s functional potential. We developed a sequencing read-based approach that can be applied to individual microbes and microbiomes to facilitate assessment of functional diversity. By adopting this point of view in analyzing metagenomic data we hope to map emergent functionalities of condition (or niche) -specific microbiomes. 32G-575

March 21

Add to Calendar 2018-03-21 11:30:00 2018-03-21 13:00:00 America/New_York When less enables more : making models and methods for modern genomics The plummeting cost of high-throughput sequencing and the astounding variety of available assays has created a scientific regime in which the bottleneck in many experiments has ceased to be our ability to acquire data, and has instead become the computational costs associated with analyzing this data. Simultaneously, we have been building sequencing data archives that hold immense potential, but which remain largely inert due to our inability to efficiently index and query "raw" experimental data.In this talk, I will discuss some of the methods that we have been developing to address these challenges as they arise in different contexts. I will highlight our recent work in fast, accurate and bias-aware methods for transcript quantification, as implemented in our tool Salmon. I will discuss Mantis, our indexing approach to enable sequence search over large collections of raw, unassembled read data. Finally, I will describe Pufferfish, a new time and space-efficient data structure for indexing and querying the colored, compacted de Bruijn graph. 32-G575

March 14

Add to Calendar 2018-03-14 11:30:00 2018-03-14 13:00:00 America/New_York Multi-view Bayesian matrix factorization for mining large-scale heterogeneous electronic health records Electronic health records (EHR) are rich heterogenous collection of patient health information. The broad adoption of EHR systems has provided clinicians and researchers unprecedented opportunities for conducting health informatics research, which promises to provide an unbiased way to characterize patents’ disease risks, thereby making actionable clinical recommendations for subsequent follow-ups of precision medicine. However, there are several challenges in modeling EHR data, including noisy irregular text in clinical notes, arbitrary bias in the billing codes, not missing at random (NMAR) lab tests, and heterogeneous data types (e.g., clinical notes, billing codes, lab tests, medications). To address the above challenges, we developed a Bayesian integrative generative model in the ravine of collaborative filtering and latent topic models. Specifically, we propose a multi-view probabilistic matrix factorization framework. In a nutshell, the proposed method factorizes multiple high-dimensional clinical-feature matrices into lower rank (basis) matrices and a common (loading) matrix that spans patients' dimension, which we interpret as the probabilistic disease mixture memberships for each patient. To learn the model parameters, we describe an efficient variational inference algorithm and its online stochastic counterpart.We demonstrate our method’s general utilities using real-world EHR data from MIMIC-III database. By 5-fold cross-validation and prospective imputation, we observe superior imputation accuracy using multiple EHR data categories (except for the target EHR data category) compared to models using individual data categories, suggesting the benefits of borrowing information across data types in otherwise extremely sparse EHR matrices. Qualitative assessment shows that heterogeneous clinical features that tend to co-occur under the same latent topics exhibit meaningful semantics of known diseases under similar epidemiology along with relevant medications and treatment procedures. We then leverage the lower dimensional patient mixture projections to predict prospective mortality of patients in critical conditions using their early admission records 1-6 months in advance. The proposed approach gives state-of-the-art performance compared to existing methods and reveals several distinct and meaningful disease topics related to the prognostic outcomes. 32G575

March 07

Add to Calendar 2018-03-07 11:30:00 2018-03-07 13:00:00 America/New_York Quantitative analysis of population-scale family trees with millions of relatives Family trees have vast applications in multiple fields from genetics to anthropology and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. Here, we collected 86 million profiles from publicly-available online data shared by genealogy enthusiasts. After extensive cleaning and validation, we obtained population-scale family trees, including a single pedigree of 13 million individuals. We leveraged the data to partition the genetic architecture of longevity by inspecting millions of relative pairs and to provide insights into the geographical dispersion of families. We also report a simple digital procedure to overlay other datasets with our resource in order to empower studies with population-scale genealogical data.Dr. Yaniv Erlich is the Chief Science Officer of MyHeritage.com and an Associate Professor of Computer Science and Computational Biology at Columbia University (leave of absence). Prior to these positions, he was a Fellow at the Whitehead Institute, MIT. Dr. Erlich received his bachelor’s degree from Tel-Aviv University, Israel (2006) and a PhD from the Watson School of Biological Sciences at Cold Spring Harbor Laboratory (2010). Dr. Erlich’s research interests are computational human genetics. Dr. Erlich is the recipient of DARPA’s Young Faculty Award (2017), the Burroughs Wellcome Career Award (2013), Harold M. Weintraub award (2010), the IEEE/ACM-CS HPC award (2008), and he was selected as one of 2010 Tomorrow’s PIs team of Genome Technology. 32-G575

February 28

Add to Calendar 2018-02-28 11:15:00 2018-02-28 13:00:00 America/New_York Understanding genetic networks with Compressive Biology Recent technological developments have made it possible to study genetic networks at unprecedented scale. In methods like Perturb-Seq, pooled CRISPR screens are combined with rich single cell molecular profiles as phenotypes to measure the effect of perturbations in tens of thousands of single cells. However, the capability of these methods is still dramatically limited; performing a genome-wide Perturb-Seq screen, for instance, would be ~1,000x larger than what has been demonstrated to date. More fundamentally, measuring the effect of combinatorial perturbations – which may be necessary to study functional redundancy – will be limited by the number of cells that can be grown, even if technological capabilities rapidly improve. In this talk, I introduce a new framework, Compressive Biology, to address these challenges. Compressive Biology has three foundational principles: (1) high-dimensional cellular systems can be organized into a compact, modular representation; (2) when genes are co-regulated and organized into response modules, measurements (e.g. a gene expression profile) can be compressed at the time of data collection; (3) when genes co-regulate their targets and are organized into functional modules, experiments (e.g. genetic perturbations) can be compressed. I discuss the mathematical theory underlying these principles, and practical applications to efficiently study genetic networks with composite measurements and composite experiments.Bio: Brian Cleary is a PhD student at MIT in the labs of Aviv Regev and Eric Lander. His work focuses on using mathematical principles, most notably compressed sensing, to describe and apply new modalities of experimentation in biology. Prior to coming to MIT, Brian founded a company developing Natural Language Processing technology, and worked in options and algorithmic trading. He is a graduate of the California Institute of Technology, with B.S. degrees in Biology and Finance. G575

February 21

Add to Calendar 2018-02-21 11:30:00 2018-02-21 13:00:00 America/New_York "Leveraging long range phasing to detect mosaicism in blood at ultra-low allelic fractions" Most genotyping methods lose information about maternal vs. paternal inheritance of alleles, producing only diploid total allele counts at each genomic position. However, the relative parental inheritance of heterozygous sites can be recovered at high accuracy using statistical techniques. This estimation problem -- termed "phasing" -- is a fundamental challenge in human genetics. In this talk, I will first describe recent advances in phasing methodology that enable efficient phase estimation with chromosome-scale accuracy in the 500,000-sample UK Biobank data set. I will further describe how phase information can be harnessed to detect subtle imbalances between maternal and paternal allelic fractions in blood DNA -- the hallmark of mosaic chromosomal alterations -- revealing new insights into the causes and consequences of clonal hematopoiesis. G575

February 14

Add to Calendar 2018-02-14 11:30:00 2018-02-14 13:00:00 America/New_York Statistical design and analysis for reproducible quantitative mass spectrometry-based experiments Statistical methodology is key for reproducible research. This is particularly true in quantitative mass spectrometry-based proteomic experiments, which must overcome many sources of bias and unwanted variation. This talk will illustrate challenges of reproducible research in quantitative mass spectrometry-based proteomics, and will discuss ways in which reproducibility is promoted by appropriate statistical methodology. We will present the methods behind MSstats, an open-source R package for statistical relative quantification of proteins and peptides, and will demonstrate that they reduce the dependency of biological conclusions on tools used for initial data processing. Finally, we will discuss the importance of statistical approaches to experimental design, and of methods for assay characterization and quality control that can assist in conducting reproducible large-scale research. 32-G575