Manolis Kellis

Manolis Kellis
  • [ Login to edit profile ]

  • Position: Professor
  • Office: 32-D524
  • Phone: +1 (617) 253-2419
  • Email:
  • Areas of Study: Disease genomics, epigenomics, comparative genomics, ENCODE, Roadmap, GTEx, genetic variation, human disease, personal genomics
  • Personal Website
  • Last Update: August 3, 2016
  • Download vCard


I am a professor of Computer Science at MIT in the area of Computational Biology. I am a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and of the Broad Institute of MIT and Harvard.

My research interests are in the area of computational biology, genomics, epigenomics, gene regulation, and genome evolution. Specifically:

(1) in the area of genome interpretation, we seek to develop comparative genomics methods to identify genes and regulatory elements systematically in the human genome

(2) in the area of gene regulation, we seek to understand the regulatory motifs involved in cell type specification during development, understand their combinatorial relationships, and how these establish expression domains in the developing embryo.

(3) in the area of epigenomics, we seek to understand the chromatin signatures associated with distinct activity states, the changing chromatin states across different cell types and during differentiation, and the sequencing signals responsible for the establishment and maintenance of chromatin marks.

(4) in the area of evolutionary genomics, understanding the dynamics of gene phylogenies across complete genes, the emergence of new gene functions by duplication and mutation, and the algorithmic principles behind phylogenomics.

Tag Cloud for Manolis Kellis

    Disease genomics, epigenomics, comparative genomics, ENCODE, Roadmap, GTEx, genetic variation, human disease, personal genomics


(see below)

Research Statement

Perhaps the greatest surprise of human genome-wide association studies (GWAS) is that 90% of disease-associated regions do not affect proteins directly, but instead lie in non-coding regions with putative gene-regulatory roles. This has increased the urgency of understanding the non-coding genome, as a key component of understanding human disease. To address this challenge, we generated maps of genomic control elements across 127 primary human tissues and cell types, and tissue-specific regulatory networks linking these elements to their target genes and their regulators. We have used these maps and circuits to understand how human genetic variation contributes to disease and cancer, providing an unbiased view of disease genetics and sometimes re-shaping our understanding of common disorders. For example, we find evidence that genetic variants contributing to Alzheimer's disease act primarily through immune processes, rather than neuronal processes. We also find that the strongest genetic association with obesity acts via a master switch controlling energy storage vs. energy dissipation in our adipocytes, rather than through the control of appetite in the brain. We have shown that we can manipulate these circuits by genome editing or gene targeting in human cells and in mice, indicating tissue-autonomous therapeutic avenues can alter metabolism. In addition to dissecting known disease-associated regions, we have combined genetic information with regulatory annotations and with epigenetic variation to discover new disease regions in cardiovascular disease, Alzheimer's disease, and prostate cancer. These results span the spectrum of common, rare, and somatic variants, and illustrate the power and broad applicability of regulatory annotations and circuits for understanding human disease and cancer.


(see also: Interactive explorer - Full list - Grouped - Google Scholar - Pubmed)
150. Joint Bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases (pdf)

    Li, Kellis

    Genome wide association studies (GWAS) provide a powerful approach for uncovering disease-associated variants in human, but fine-mapping the causal variants remains a challenge. This is partly remedied by prioritization of disease-associated variants that overlap GWAS-enriched epigenomic annotations. Here, we introduce a new Bayesian model RiVIERA (Risk Variant Inference using Epigenomic Reference Annotations) for inference of driver variants from summary statistics across multiple traits using hundreds of epigenomic annotations. In simulation, RiVIERA promising power in detecting causal variants and causal annotations, the multi-trait joint inference further improved the detection power. We applied RiVIERA to model the existing GWAS summary statistics of 9 autoimmune diseases and Schizophrenia by jointly harnessing the potential causal enrichments among 848 tissue-specific epigenomics annotations from ENCODE/Roadmap consortium covering 127 cell/tissue types and 8 major epigenomic marks. RiVIERA identified meaningful tissue-specific enrichments for enhancer regions defined by H3K4me1 and H3K27ac for Blood T-Cell specifically in the nine autoimmune diseases and Brain-specific enhancer activities exclusively in Schizophrenia. Moreover, the variants from the 95% credible sets exhibited high conservation and enrichments for GTEx whole-blood eQTLs located within transcription-factor-binding-sites and DNA-hypersensitive-sites. Furthermore, joint modeling the nine immune traits by simultaneously inferring and exploiting the underlying epigenomic correlation between traits further improved the functional enrichments compared to single-trait models.

    Nucleic Acids Research gkw627, Jul 12, 2016

148. Discovery and validation of sub-threshold genome-wide association study loci using epigenomic signatures (pdf)

    Wang, Tucker, Rizki, Mills, Krijger, de Wit, Subramanian, Bartell, Nguyen, Ye, Leyton-Mange, Dolmatova, van der Harst, de Laat, Ellinor, Newton-Cheh, Milan, Kellis, Boyer

    Genetic variants identified by genome-wide association studies explain only a modest proportion of heritability, suggesting that meaningful associations lie 'hidden' below current thresholds. Here, we integrate information from association studies with epigenomic maps to demonstrate that enhancers significantly overlap known loci associated with the cardiac QT interval and QRS duration. We apply functional criteria to identify loci associated with QT interval that do not meet genome-wide significance and are missed by existing studies. We demonstrate that these 'sub-threshold' signals represent novel loci, and that epigenomic maps are effective at discriminating true biological signals from noise. We experimentally validate the molecular, gene-regulatory, cellular and organismal phenotypes of these sub-threshold loci, demonstrating that most sub-threshold loci have regulatory consequences and that genetic perturbation of nearby genes causes cardiac phenotypes in mouse. Our work provides a general approach for improving the detection of novel loci associated with complex human traits.

    eLife 5:e10557. May 10 2016. pii: e10557. doi: 10.7554/eLife.10557

144. Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases (pdf)

    Marbach, Lamparter, Quon, Kellis, Kutalik, Bergmann.

    Mapping perturbed molecular circuits that underlie complex diseases remains a great challenge. We developed a comprehensive resource of 394 cell type- and tissue-specific gene regulatory networks for human, each specifying the genome-wide connectivity among transcription factors, enhancers, promoters and genes. Integration with 37 genome-wide association studies (GWASs) showed that disease-associated genetic variants-including variants that do not reach genome-wide significance-often perturb regulatory modules that are highly specific to disease-relevant cell types or tissues. Our resource opens the door to systematic analysis of regulatory programs across hundreds of human cell types and tissues (

    Nature Methods. Mar 7, 2016

142. HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease. (pdf) (scholar)

    Ward, Kellis

    More than 90% of common variants associated with complex traits do not affect proteins directly, but instead the circuits that control gene expression. This has increased the urgency of understanding the regulatory genome as a key component for translating genetic results into mechanistic insights and ultimately therapeutics. To address this challenge, we developed HaploReg ( to aid the functional dissection of genome-wide association study (GWAS) results, the prediction of putative causal variants in haplotype blocks, the prediction of likely cell types of action, and the prediction of candidate target genes by systematic mining of comparative, epigenomic and regulatory annotations. Since first launching the website in 2011, we have greatly expanded HaploReg, increasing the number of chromatin state maps to 127 reference epigenomes from ENCODE 2012 and Roadmap Epigenomics, incorporating regulator binding data, expanding regulatory motif disruption annotations, and integrating expression quantitative trait locus (eQTL) variants and their tissue-specific target genes from GTEx, Geuvadis, and other recent studies. We present these updates as HaploReg v4, and illustrate a use case of HaploReg for attention deficit hyperactivity disorder (ADHD)-associated SNPs with putative brain regulatory mechanisms.

    Nucleic Acids Res. 2015 Dec 10. pii: gkv1340.

139. FTO Obesity Variant Circuitry and Adipocyte Browning in Humans (pdf) (scholar)

    Claussnitzer, Dankel, Kim, Quon, Meuleman, Haugen, Glunk, Sousa, Beaudry, Puviindran, Abdennur, Liu, Svensson, Hsu, Drucker, Mellgren, Hui, Hauner, Kellis

    Genome-wide association studies can be used to identify disease-relevant genomic regions, but interpretation of the data is challenging. The FTO region harbors the strongest genetic association with obesity, yet the mechanistic basis of this association remains elusive. We examined epigenomic data, allelic activity, motif conservation, regulator expression, and gene coexpression patterns, with the aim of dissecting the regulatory circuitry and mechanistic basis of the association between the FTO region and obesity. We validated our predictions with the use of directed perturbations in samples from patients and from mice and with endogenous CRISPR-Cas9 genome editing in samples from patients. Our data indicate that the FTO allele associated with obesity represses mitochondrial thermogenesis in adipocyte precursor cells in a tissue-autonomous manner. The rs1421085 T-to-C single-nucleotide variant disrupts a conserved motif for the ARID5B repressor, which leads to derepression of a potent preadipocyte enhancer and a doubling of IRX3 and IRX5 expression during early adipocyte differentiation. This results in a cell-autonomous developmental shift from energy-dissipating beige (brite) adipocytes to energy-storing white adipocytes, with a reduction in mitochondrial thermogenesis by a factor of 5, as well as an increase in lipid storage. Inhibition of Irx3 in adipose tissue in mice reduced body weight and increased energy dissipation without a change in physical activity or appetite. Knockdown of IRX3 or IRX5 in primary adipocytes from participants with the risk allele restored thermogenesis, increasing it by a factor of 7, and overexpression of these genes had the opposite effect in adipocytes from nonrisk-allele carriers. Repair of the ARID5B motif by CRISPR-Cas9 editing of rs1421085 in primary adipocytes from a patient with the risk allele restored IRX3 and IRX5 repression, activated browning expression programs, and restored thermogenesis, increasing it by a factor of 7. Our results point to a pathway for adipocyte thermogenesis regulation involving ARID5B, rs1421085, IRX3, and IRX5, which, when manipulated, had pronounced pro-obesity and anti-obesity effects

    New England Journal of Medicine 373(10):895-907. Sep 3, 2015;

138. Systematic chromatin state comparison of epigenomes associated with diverse properties including sex and tissue type (pdf) (scholar)

    Yen, Kellis

    Epigenomic data sets provide critical information about the dynamic role of chromatin states in gene regulation, but a key question of how chromatin state segmentations vary under different conditions across the genome has remained unaddressed. Here we present ChromDiff, a group-wise chromatin state comparison method that generates an information-theoretic representation of epigenomes and corrects for external covariate factors to better isolate relevant chromatin state changes. By applying ChromDiff to the 127 epigenomes from the Roadmap Epigenomics and ENCODE projects, we provide novel group-wise comparative analyses across sex, tissue type, state and developmental age. Remarkably, we find that distinct sets of epigenomic features are maximally discriminative for different group-wise comparisons, in each case revealing distinct enriched pathways, many of which do not show gene expression differences. Our methodology should be broadly applicable for epigenomic comparisons and provides a powerful new tool for studying chromatin state differences at the genome scale.

    Nature Communications 6:7973. Aug 18, 2015

137. Deep learning for regulatory genomics (pdf) (scholar)

    Park, Kellis

    A fundamental unit of gene-regulatory control is the contact between a regulatory protein and its target DNA or RNA molecule. Biophysical models that directly predict these interactions are incomplete and confined to specific types of structures, but computational analysis of large-scale experimental datasets allows regulatory motifs to be identified by their over- representation in target sequences. In this issue, Alipanahi et al describe the use of a deep learning strategy to calculate protein-nucleic acid interactions from diverse experimental data sets. They show that their algorithm, called DeepBind, is broadly applicable and results in increased predictive power compared to traditional single-domain methods, and they use its predictions to discover regulatory motifs, to predict RNA editing and alternative splicing, and to interpret genetic variants. Looking beyond regulatory motifs, the current results illustrate the power of deep learning for biological data analysis in general. The approach can increase predictive power for specific tasks, integrate diverse datasets across data types, and provide greater generalization given the focus on representation learning and not simply classification accuracy. Systematic visualization and exploration of internal representations at each layer can yield mechanistic insights and guide new experiments and research directions. More broadly, deep learning can serve as a guiding principle to organize both hypothesis-driven research and exploratory investigation. For this potential to be realized, statistical and biological tasks must be integrated at all levels, including study design, experiment planning, model building and refinement, and data interpretation. and to interpret genetic variants

    Nature Biotechnology 33(8):825-6. Aug 7, 2015

131. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans (pdf) (scholar)

    GTEx Consortium; Ardlie, Deluca, Segrè, Sullivan, Young, Gelfand, Trowbridge, Maller, Tukiainen, Lek, Ward, Kheradpour, Iriarte, Meng, Palmer, Esko, Winckler, Hirschhorn, Kellis, MacArthur, Getz, Shabalin, Li, Zhou, Nobel, Rusyn, Wright, Lappalainen, Ferreira, Ongen, Rivas, Battle, Mostafavi, Monlong, Sammeth, Melé, Reverter, Goldmann, Koller, Guigó, McCarthy, Dermitzakis, Gamazon, Im, Konkashbaev, Nicolae, Cox, Flutre, Wen, Stephens, Pritchard, Tu, Zhang, Huang, Long, Lin, Yang, Zhu, Liu, Brown, Mestichelli, Tidwell, Lo, Salvatore, Shad, Thomas, Lonsdale, Moser, Gillard, Karasik, Ramsey, Choi, Foster, Syron, Fleming, Magazine, Hasz, Walters, Bridge, Miklos, Sullivan, Barker, Traino, Mosavel, Siminoff, Valley, Rohrer, Jewell, Branton, Sobin, Barcus, Qi, McLean, Hariharan, Um, Wu, Tabor, Shive, Smith, Buia, Undale, Robinson, Roche, Valentino, Britton, Burges, Bradbury, Hambright, Seleski, Korzeniewski, Erickson, Marcus, Tejada, Taherian, Lu, Basile, Mash, Volpi, Struewing, Temple, Boyer, Colantuoni, Little, Koester, Carithers, Moore, Guan, Compton, Sawyer, Demchok, Vaught, Rabiner, Lockhart, Ardlie, Getz, Wright, Kellis, Volpi, Dermitzakis

    Understanding the functional consequences of genetic variation, and how it affects complex human disease and quantitative traits, remains a critical challenge for biomedicine. We present an analysis of RNA sequencing data from 1641 samples across 43 tissues from 175 individuals, generated as part of the pilot phase of the Genotype-Tissue Expression (GTEx) project. We describe the landscape of gene expression across tissues, catalog thousands of tissue-specific and shared regulatory expression quantitative trait loci (eQTL) variants, describe complex network relationships, and identify signals from genome-wide association studies explained by eQTLs. These findings provide a systematic understanding of the cellular and biological consequences of human genetic variation and of the heterogeneity of such effects among a diverse set of human tissues

    Science 348(6235):648-60. May 8, 2015

127. Integrative analysis of 111 reference human epigenomes (pdf) (scholar)

    Roadmap Epigenomics Consortium, Kundaje, Meuleman, Ernst, Bilenky, Yen, Heravi-Moussavi, Kheradpour, Zhang, Wang, Ziller, Amin, Whitaker, Schultz, Ward, Sarkar, Quon, Sandstrom, Eaton, Wu, Pfenning, Wang, Claussnitzer, Liu, Coarfa, Harris, Shoresh, Epstein, Gjoneska, Leung, Xie, Hawkins, Lister, Hong, Gascard, Mungall, Moore, Chuah, Tam, Canfield, Hansen, Kaul, Sabo, Bansal, Carles, Dixon, Farh, Feizi, Karlic, Kim, Kulkarni, Li, Lowdon, Elliott, Mercer, Neph, Onuchic, Polak, Rajagopal, Ray, Sallari, Siebenthall, Sinnott-Armstrong, Stevens, Thurman, Wu, Zhang, Zhou, Beaudet, Boyer, De Jager, Farnham, Fisher, Haussler, Jones, Li, Marra, McManus, Sunyaev, Thomson, Tlsty, Tsai, Wang, Waterland, Zhang, Chadwick, Bernstein, Costello, Ecker, Hirst, Meissner, Milosavljevic, Ren, Stamatoyannopoulos, Wang, Kellis

    The reference human genome sequence set the stage for studies of genetic variation and its association with human disease, but epigenomic studies lack a similar reference. To address this need, the NIH Roadmap Epigenomics Consortium generated the largest collection so far of human epigenomes for primary cells and tissues. Here we describe the integrative analysis of 111 reference human epigenomes generated as part of the programme, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression. We establish global maps of regulatory elements, define regulatory modules of coordinated activity, and their likely activators and repressors. We show that disease- and trait-associated genetic variants are enriched in tissue-specific epigenomic marks, revealing biologically relevant cell types for diverse human traits, and providing a resource for interpreting the molecular basis of human disease. Our results demonstrate the central role of epigenomic information for understanding gene regulation, cellular differentiation and human disease

    Nature 518:317-30. Feb 19, 2015 doi:10.1038/nature14248. PMID 25693563

126. Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer's disease (pdf) (scholar)

    Gjoneska, Pfenning, Mathys, Quon, Kundaje, Tsai, Kellis

    Alzheimer's disease (AD) is a severe age-related neurodegenerative disorder characterized by accumulation of amyloid-beta plaques and neurofibrillary tangles, synaptic and neuronal loss, and cognitive decline. Several genes have been implicated in AD, but chromatin state alterations during neurodegeneration remain uncharacterized. Here we profile transcriptional and chromatin state dynamics across early and late pathology in the hippocampus of an inducible mouse model of AD-like neurodegeneration. We find a coordinated downregulation of synaptic plasticity genes and regulatory regions, and upregulation of immune response genes and regulatory regions, which are targeted by factors that belong to the ETS family of transcriptional regulators, including PU.1. Human regions orthologous to increasing-level enhancers show immune-cell-specific enhancer signatures as well as immune cell expression quantitative trait loci, while decreasing-level enhancer orthologues show fetal-brain-specific enhancer activity. Notably, AD-associated genetic variants are specifically enriched in increasing-level enhancer orthologues, implicating immune processes in AD predisposition. Indeed, increasing enhancers overlap known AD loci lacking protein-altering variants, and implicate additional loci that do not reach genome-wide significance. Our results reveal new insights into the mechanisms of neurodegeneration and establish the mouse as a useful model for functional studies of AD regulatory regions

    Nature 518:365-9. Feb 19, 2015 doi: 10.1038/nature14252. PMID 25693568

124. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues (pdf) (scholar)

    Ernst, Kellis

    With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.

    Nature Biotechnology Feb 18, 2015 doi 10.1038/nbt.3157 PMID 25690853

108. Defining functional DNA elements in the human genome (pdf) (scholar)

    Kellis, Wold, Snyder, Bernstein, Kundaje, Marinov, Ward, Birney, Crawford, Dekker, Dunham, Elnitski, Farnham, Feingold, Gerstein, Giddings, Gilbert, Gingeras, Green, Guigo, Hubbard, Kent, Lieb, Myers, Pazin, Ren, Stamatoyannopoulos, Weng, White, Hardison

    With the completion of the human genome sequence, attention turned to identifying and annotating its functional DNA elements. As a complement to genetic and comparative genomics approaches, the Encyclopedia of DNA Elements Project was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types. The resulting genome-wide data reveal sites of biochemical activity with high positional resolution and cell type specificity that facilitate studies of gene regulation and interpretation of noncoding variants associated with human disease. However, the biochemically active regions cover a much larger fraction of the genome than do evolutionarily conserved regions, raising the question of whether nonconserved but biochemically active regions are truly functional. Here, we review the strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies. We also analyze the relationship between signal intensity, genomic coverage, and evolutionary conservation. Our results reinforce the principle that each approach provides complementary information and that we need to use combinations of all three to elucidate genome function in human biology and disease.

    PNAS Apr 23, 2014

107. Evolutionary dynamics and tissue specificity of human long noncoding RNAs in six mammals (pdf) (scholar)

    Washietl, Kellis*, Garber*

    Long intergenic noncoding RNAs (lincRNAs) play diverse regulatory roles in human development and disease, but little is known about their evolutionary history and constraint. Here, we characterize human lincRNA expression patterns in nine tissues across six mammalian species and multiple individuals. Of the 1898 human lincRNAs expressed in these tissues, we find orthologous transcripts for 80% in chimpanzee, 63% in rhesus, 39% in cow, 38% in mouse, and 35% in rat. Mammalian-expressed lincRNAs show remarkably strong conservation of tissue specificity, suggesting that it is selectively maintained. In contrast, abundant splice-site turnover suggests that exact splice sites are not critical. Relative to evolutionarily young lincRNAs, mammalian-expressed lincRNAs show higher primary sequence conservation in their promoters and exons, increased proximity to protein-coding genes enriched for tissue-specific functions, fewer repeat elements, and more frequent single-exon transcripts. Remarkably, we find that ~20% of human lincRNAs are not expressed beyond chimpanzee and are undetectable even in rhesus. These hominid-specific lincRNAs are more tissue specific, enriched for testis, and faster evolving within the human lineage.

    Genome Research 24(4):616-28, Jan 15, 2014

100. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments (pdf) (scholar)

    Kheradpour, Kellis

    Recent advances in technology have led to a dramatic increase in the number of available transcription factor ChIP-seq and ChIP-chip data sets. Understanding the motif content of these data sets is an important step in understanding the underlying mechanisms of regulation. Here we provide a systematic motif analysis for 427 human ChIP-seq data sets using motifs curated from the literature and also discovered de novo using five established motif discovery tools. We use a systematic pipeline for calculating motif enrichment in each data set, providing a principled way for choosing between motif variants found in the literature and for flagging potentially problematic data sets. Our analysis confirms the known specificity of 41 of the 56 analyzed factor groups and reveals motifs of potential cofactors. We also use cell type-specific binding to find factors active in specific conditions. The resource we provide is accessible both for browsing a small number of factors and for performing large-scale systematic analyses. We provide motif matrices, instances and enrichments in each of the ENCODE data sets. The motifs discovered here have been used in parallel studies to validate the specificity of antibodies, understand cooperativity between data sets and measure the variation of motif binding across individuals and species

    Nucleic Acids Res. 2013 Dec 13

99. Most parsimonious reconciliation in the presence of gene duplication, loss, and deep coalescence using labeled coalescent trees (pdf) (scholar)

    Wu, Rasmussen, Bansal, Kellis

    Accurate gene tree-species tree reconciliation is fundamental to inferring the evolutionary history of a gene family. However, although it has long been appreciated that population-related effects such as incomplete lineage sorting (ILS) can dramatically affect the gene tree, many of the most popular reconciliation methods consider discordance only due to gene duplication and loss (and sometimes horizontal gene transfer). Methods that do model ILS are either highly parameterized or consider a restricted set of histories, thus limiting their applicability and accuracy. To address these challenges, we present a novel algorithm DLCpar for inferring a most parsimonious (MP) history of a gene family in the presence of duplications, losses, and ILS. Our algorithm relies on a new reconciliation structure, the labeled coalescent tree (LCT), that simultaneously describes coalescent and duplication-loss history. We show that the LCT representation enables an exhaustive and efficient search over the space of reconciliations, and, for most gene families, the least common ancestor (LCA) mapping is an optimal solution for the species mapping between the gene tree and species tree in a MP LCT. Applying our algorithm to a variety of clades, including flies, fungi, and primates, as well as to simulated phylogenies, we achieve high accuracy, comparable to sophisticated probabilistic reconciliation methods, at reduced runtime and with far fewer parameters. These properties enable inference of complex evolution of gene families across a broad range of species and large data sets.

    Genome Research 24(3):475-86, Dec 5, 2013.

97. Extensive Variation in Chromatin States Across Humans (pdf) (scholar)

    Kasowski, Kyriazopoulou-Panagiotopoulou, Grubert, Zaugg, Kundaje, Liu, Boyle, Zhang, Zakharia, Spacek, Li, Xie, Olarerin-George, Steinmetz, Hogenesch, Kellis, Batzoglou, Snyder

    The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.

    Science. Oct 17, 2013

94. Network deconvolution as a general method to distinguish direct dependencies in networks (pdf) (scholar)

    Feizi, Marbach, Medard, Kellis

    Recognizing direct relationships between variables connected in a network is a pervasive problem in biological, social and information sciences as correlation-based networks contain numerous indirect relationships. Here we present a general method for inferring direct effects from an observed correlation matrix containing both direct and indirect effects. We formulate the problem as the inverse of network convolution, and introduce an algorithm that removes the combined effect of all indirect paths of arbitrary length in a closed-form solution by exploiting eigen-decomposition and infinite-series sums. We demonstrate the effectiveness of our approach in several network applications: distinguishing direct targets in gene expression regulatory networks; recognizing directly interacting amino-acid residues for protein structure prediction from sequence alignments; and distinguishing strong collaborations in co-authorship social networks using connectivity information alone. In addition to its theoretical impact as a foundational graph theoretic tool, our results suggest network deconvolution is widely applicable for computing direct dependencies in network science across diverse disciplines

    Nature Biotechnology, Jul 14, 2013

89. Systematic dissection of regulatory motifs in 2,000 predicted human enhancers using a massively parallel reporter assay (pdf) (scholar)

    Kheradpour, Ernst, Melnikov, Rogov, Wang, Zhang, Alston, Mikkelsen, Kellis

    Genome-wide chromatin maps have permitted the systematic mapping of putative regulatory elements across multiple human cell types, revealing tens of thousands of candidate distal enhancer regions. However, until recently, their experimental dissection by directed regulatory motif disruption has remained unfeasible at the genome scale, due to the technological lag in large-scale DNA synthesis. Here, we employ a massively parallel reporter assay (MPRA) to measure the transcriptional levels induced by 145bp DNA segments centered on evolutionarily-conserved regulatory motif instances and found in enhancer chromatin states. We select five predicted activators (HNF1, HNF4, FOXA, GATA, NFE2L2) and two predicted repressors (GFI1, ZFP161) and measure reporter expression in erythroleukemia (K562) and liver carcinoma (HepG2) cell lines. We test 2,104 wild-type sequences and an additional 3,314 engineered enhancer variants containing targeted motif disruptions, each using 10 barcode tags in two cell lines and 2 replicates. The resulting data strongly confirm the enhancer activity and cell type specificity of enhancer chromatin states, the ability of 145bp segments to recapitulate both, the necessary role of regulatory motifs in enhancer function, and the complementary roles of activator and repressor motifs. We find statistically robust evidence that (1) scrambling, removing, or disrupting the predicted activator motifs abolishes enhancer function, while silent or motif-improving changes maintain enhancer activity; (2) evolutionary conservation, nucleosome exclusion, binding of other factors, and strength of the motif match are all associated with wild-type enhancer activity; (3) scrambling repressor motifs leads to aberrant reporter expression in cell lines where the enhancers are usually not active. Our results suggest a general strategy for deciphering cis-regulatory elements by systematic large-scale experimental manipulation, and provide quantitative enhancer activity measurements across thousands of constructs that can be mined to generate and test predictive models of gene expression

    Genome Research doi:10.1101/gr.144899.112, March 19, 2013

81. Interpreting noncoding genetic variation in complex traits and human disease (pdf) (scholar)

    Ward, Kellis

    Association studies provide genome-wide information about the genetic basis of complex disease, but medical research has focused primarily on protein-coding variants, owing to the difficulty of interpreting noncoding mutations. This picture has changed with advances in the systematic annotation of functional noncoding elements. Evolutionary conservation, functional genomics, chromatin state, sequence motifs and molecular quantitative trait loci all provide complementary information about the function of noncoding sequences. These functional maps can help with prioritizing variants on risk haplotypes, filtering mutations encountered in the clinic and performing systems-level analyses to reveal processes underlying disease associations. Advances in predictive modeling can enable data-set integration to reveal pathways shared across loci and alleles, and richer regulatory models can guide the search for epistatic interactions. Lastly, new massively parallel reporter experiments can systematically validate regulatory predictions. Ultimately, advances in regulatory and systems genomics can help unleash the value of whole-genome sequencing for personalized genomic risk assessment, diagnosis and treatment

    Nature Biotechnology 30:1095-1106, Nov 2012

77. Evidence of Abundant Purifying Selection in Humans for Recently Acquired Regulatory Functions (pdf) (scholar)

    Ward, Kellis

    Although only 5% of the human genome is conserved across mammals, a substantially larger portion is biochemically active, raising the question of whether the additional elements evolve neutrally or confer a lineage-specific fitness advantage. To address this question, we integrate human variation information from the 1000 Genomes Project and activity data from the ENCODE Project. A broad range of transcribed and regulatory nonconserved elements show decreased human diversity, suggesting lineage-specific purifying selection. Conversely, conserved elements lacking activity show increased human diversity, suggesting that some recently became nonfunctional. Regulatory elements under human constraint in nonconserved regions were found near color vision and nerve-growth genes, consistent with purifying selection for recently evolved functions. Our results suggest continued turnover in regulatory regions, with at least an additional 4% of the human genome subject to lineage-specific constraint.

    Science 337:1675-8, Sep 5, 2012

74. An integrated encyclopedia of DNA elements in the human genome (pdf) (scholar)

    ENCODE Project Consortium

    The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

    Nature 489:57-74. Sep 6, 2012

61. A high-resolution map of human evolutionary constraint using 29 mammals (pdf) (scholar)

    Lindblad-Toh, Garber, Zuk, Lin, Parker, Washietl, Kheradpour, Ernst, Jordan, Mauceli, Ward, Lowe, Holloway, Clamp, Gnerre, Alfoldi, Beal, Chang, Clawson, Palma, Fitzgerald, Flicek, Guttman, Hubisz, Jaffe, Jungreis, Kostka, Lara, Martins, Massingham, Moltke, Raney, Rasmussen, Stark, Vilella, Wen, Xie, Zody, Worley, Kovar, Muzny, Gibbs, Warren, Mardis, Weinstock, Wilson, Birney, Margulies, Herrero, Green, Haussler, Siepel, Goldman, Pollard, Pedersen, Lander, Kellis

    The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering 4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for 60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease.

    Nature 478:476-82, Oct 12 2011

49. Mapping and analysis of chromatin state dynamics in nine human cell types (pdf) (scholar)

    Ernst, Kheradpour, Mikkelsen, Shoresh, Ward, Epstein, Zhang, Wang, Issner, Coyne, Ku, Durham, Kellis*, Bernstein*

    Chromatin profiling has emerged as a powerful means for annotating genomic elements and detecting regulatory activity. Here we generate and analyze a compendium of epigenomic maps for nine chromatin marks across nine cell types, in order to systematically characterize cis-regulatory elements, their cell type-specificities, and their functional interactions. We first identify recurrent combinations of histone modifications and use them to annotate diverse regulatory elements including promoters, enhancers, transcripts and insulators in each cell type. We next characterize the dynamics of these elements, revealing meaningful patterns of activity for promoter states and exquisite cell type-selectivity for enhancer states. We define multi-cell activity profiles that reflect the patterns of enhancer state activity across cell types, as well as analogous profiles for gene expression, regulatory motif enrichments, and expression of the corresponding regulators. We use correlations between these profiles to link candidate enhancers to putative target genes, to infer cell type-specific activators and repressors, and to predict and validate functional regulator binding motifs in specific chromatin states. These functional annotations and regulatory predictions enable us to revisit intergenic single-nucleotide polymorphisms (SNPs) associated with human disease in genome-wide association studies (GWAS). We find that for several diseases, topscoring SNPs are precisely positioned within enhancer elements specifically active in relevant cell types. In several cases a disease variant affects a motif instance for one of the predicted causal regulators, thus providing a potential mechanistic explanation for the disease association. Our study presents a general framework for applying multi-cell chromatin state analysis to decipher cis-regulatory connections and their role in health and disease.

    Nature, doi:10.1038/nature09906, Epub ahead of print: March 23, 2011

48. A Cis-Regulatory Map of the Drosophila Genome (pdf) (scholar)

    Negre, Brown, Ma, Bristow, Miller, Kheradpour, Loriaux, Sealfon, Li, Ishii, Spokony, Chen, Hwang, Wagner, Auburn, Domanus, Shah, Morrison, Zieba, Suchy, Senderowicz, Victorsen, Bild, Grundstad, Hanley, Mannervik, Venken, Bellen, White, Russell, Grossman, Ren, Posakony, Kellis, White

    Following the sequencing of human and model organism genomes, genome-wide annotation of regulatory information has emerged as a major challenge. Here we describe an initial map of the Drosophila melanogaster regulatory genome based on the developmental dynamics of chromatin modifications and chromatin modifying enzymes, on polymerase occupancy of promoters, on the dynamic binding of enhancer-associated proteins such as the transcriptional co-factor CBP, and on the localization of forty-one site-specific transcription factors at different stages of development. The entire dataset provides protein modification and binding annotations across 94% of the genome along with prediction and validation of 4 classes of regulatory elements: insulators, promoters, silencers and enhancers. This regulatory map reveals several newly discovered properties of genome regulation, including the lack of epigenetic marks at promoters of transiently expressed genes, the association of specific Histone Deacetylases (HDACs) with Polycomb Response Elements, the early role of CBP as a marker of enhancers and the occurence of high-occupancy transcription factor binding sites that correlate with gene expression. Using these data we also generated a combinatorial analysis of transcription factors and DNA sequence motifs that are associated with different sets of developmentally co-expressed genes, providing a database for discovering the sets of regulatory inputs that control regulatory element function. Together, these cis-regulatory annotations serve as a foundation for further detailed analyses of the genomic regulatory code in Drosophila.

    Nature 471:527-531, March 23, 2011.

46. Identification of functional elements and regulatory circuits in Drosophila by large-scale data integration (pdf) (scholar)

    The modENCODE Consortium, Roy, Ernst, Kharchenko, Kheradpour, Negre, Eaton, Landolin, Bristow, Ma, Lin, Washietl, Arshinoff, Ay, Meyer, Robine, Washington, Di Stefano, Berezikov, Brown, Brown, Candeias, Carlson, Carr, Jungreis, Marbach, Sealfon, Tolstorukov, Alekseyenko, Artieri, Boley, Booth, Brooks, Dai, Davis, Duff, Feng, Gorchakov, Gu, Henikoff, Kapranov, Li, Li, MacAlpine, Malone, Minoda, Nordman, Okamura, Perry, Powell, Riddle, Sakai, Samsonova, Sandler, Schwartz, Sher, Spokony, Sturgill, van Baren, Will, Wan, Yang, Yu, Feingold, Good, Guyer, Lowdon, Ahmad, Andrews, Berger, Bickel, Brenner, Brent, Cherbas, Elgin, Gingeras, Grossman, Hoskins, Kaufman, Kent, Kuroda, Orr-Weaver, Perrimon, Pirrotta, Posakony, Ren, Russell, Cherbas, Graveley, Lewis, Micklem, Oliver, Park, Celniker, Henikoff, Karpen, Lai, MacAlpine, Stein, White, Kellis

    Several years after the initial sequencing of the genomes from human and other organisms, the vast majority of each genome remains unannotated, and it is still unclear how to translate genomic information into a functional map of cellular and developmental programs. To address this question, the Drosophila modENCODE project has undertaken a large-scale effort to comprehensively map transcription, regulator binding, chromatin state, replication, and nucleosome properties across a developmental time-course and in multiple cell lines. Here, we report our initial integrative analysis of the first phase of the project, encompassing more than 1000 datasets generated over four years across six production centers. Our integrated annotation enabled the discovery of new proteincoding, non-coding, RNA regulatory, replication, and chromatin elements that more than triple the annotated portion of the genome. We study correlated activity patterns of these elements to infer a functional regulatory network, which we use to predict putative functions for new genes, reveal stage-specific and tissue-specific regulators, and infer predictive models of gene expression. Our results provide a reference annotation that can inform directed experimental and computational studies in Drosophila and related species, and provide a model for systematic data integration towards the comprehensive genomic and functional annotation of any genome, including the human.

    Science, Dec 24, 2010.

42. Discovery and characterization of chromatin states for systematic annotation of the human genome (pdf) (scholar)

    Ernst, Kellis

    A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal 'chromatin states' in human T cells, based on recurrent and spatially coherent combinations of chromatin marks. We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, large-scale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.

    Nature Biotechnology 2010 Aug;28(8):817-25. Epub 2010 Jul 25. PMCID: PMC2919626 PMID: 20657582

17. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes (pdf) (scholar)

    Lin, Carlson, Crosby, Matthews, Yu, Park, Wan, Schroeder, Gramates, St, Roark, Wiley, Kulathinal, Zhang, Myrick, Antone, Celniker, Gelbart, Kellis

    The availability of sequenced genomes from 12 Drosophila species has enabled the use of comparative genomics for the systematic discovery of functional elements conserved within this genus. We have developed quantitative metrics for the evolutionary signatures specific to protein-coding regions and applied them genome-wide, resulting in 1193 candidate new protein-coding exons in the D. melanogaster genome. We have reviewed these predictions by manual curation and validated a subset by directed cDNA screening and sequencing, revealing both new genes and new alternative splice forms of known genes. We also used these evolutionary signatures to evaluate existing gene annotations, resulting in the validation of 87% of genes lacking descriptive names and identifying 414 poorly conserved genes that are likely to be spurious predictions, noncoding, or species-specific genes. Furthermore, our methods suggest a variety of refinements to hundreds of existing gene models, such as modifications to translation start codons and exon splice boundaries. Finally, we performed directed genome-wide searches for unusual protein-coding structures, discovering 149 possible examples of stop codon readthrough, 125 new candidate ORFs of polycistronic mRNAs, and several candidate translational frameshifts. These results affect >10% of annotated fly genes and demonstrate the power of comparative genomics to enhance our understanding of genome organization, even in a model organism as intensively studied as Drosophila melanogaster.

    Genome Res. 2007 Dec;17(12):1823-36. Epub 2007 Nov 7. PMCID: PMC2099591 PMID: 17989253

6. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae (pdf) (scholar)

    Kellis, Birren, Lander

    Whole-genome duplication followed by massive gene loss and specialization has long been postulated as a powerful mechanism of evolutionary innovation. Recently, it has become possible to test this notion by searching complete genome sequence for signs of ancient duplication. Here, we show that the yeast Saccharomyces cerevisiae arose from ancient whole-genome duplication, by sequencing and analysing Kluyveromyces waltii, a related yeast species that diverged before the duplication. The two genomes are related by a 1:2 mapping, with each region of K. waltii corresponding to two regions of S. cerevisiae, as expected for whole-genome duplication. This resolves the long-standing controversy on the ancestry of the yeast genome, and makes it possible to study the fate of duplicated genes directly. Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions.

    Nature. 2004 Apr 8;428(6983):617-24. Epub 2004 Mar 7. PMID: 15004568

2. Sequencing and comparison of yeast species to identify genes and regulatory elements (pdf) (scholar)

    Kellis, Patterson, Endrizzi, Birren, Lander

    Identifying the functional elements encoded in a genome is one of the principal challenges in modern biology. Comparative genomics should offer a powerful, general approach. Here, we present a comparative analysis of the yeast Saccharomyces cerevisiae based on high-quality draft sequences of three related species (S. paradoxus, S. mikatae and S. bayanus). We first aligned the genomes and characterized their evolution, defining the regions and mechanisms of change. We then developed methods for direct identification of genes and regulatory motifs. The gene analysis yielded a major revision to the yeast gene catalogue, affecting approximately 15% of all genes and reducing the total count by about 500 genes. The motif analysis automatically identified 72 genome-wide elements, including most known regulatory motifs and numerous new motifs. We inferred a putative function for most of these motifs, and provided insights into their combinatorial interactions. The results have implications for genome analysis of diverse organisms, including the human.

    Nature. 2003 May 15;423(6937):241-54. PMID: 12748633

C01. Crust: A new Voronoi-Based Surface Reconstruction Algorithm (pdf) (scholar)

    Amenta, Bern, Kellis (Kamvysselis)

    We describe our experience with a new algorithm for the reconstruction of surfaces from unorganized sample points in 3D. The algorithm is the first for this problem with provable guarantees. Given a "good sample" from a smooth surface, the output is guaranteed to be topologically correct and convergent to the original surface as the sampling density increases. The definition of a good sample is itself interesting: the required sampling density varies locally, rigorously capturing the intuitive notion that featureless areas can be reconstructed from fewer samples. The output mesh interpolates, rather than approximates, the input points. Our algorithm is based on the three-dimensional Voronoi diagram. Given a good program for this fundamental subroutine, the algorithm is quite easy to implement.

    ACM SIGGRAPH, v. 32, p. 415-421, Jul 19, 1998.


  • MIT EECS: Faculty Research Innovation Fellowship (2016)
  • Athens Technology Center for Research and Education (AIT): Niki Award (2011)
  • National Science Foundation: United States Presidential Early Career Award for Scientists and Engineers (PECASE) (2010)
  • MIT: Karl Van Tassel Career Development Professorship (2007)
  • National Science Foundation: CAREER award (2007)
  • Technology Review: TR35 - Top 35 Innovator Under the Age of 35 (2006)
  • Genome Technology: Top Young PIs (2006)
  • MIT: Distinguished Alumnus (1964) Career Development Chair (2005)
  • MIT EECS: Sprowls Doctoral Dissertation Award (2003)