Back to Events

Seminar Series

September 29

Add to Calendar 2025-09-29 16:00:00 2025-09-29 17:00:00 America/New_York ML Tea: Collapse-Proof Non-Contrastive Self-Supervised Learning / Data Attribution in High Dimensions and without Strong Convexity Speakers: Emanuele Sansone and Ittai RubinsteinBios:Emanuele Sansone is a Postdoctoral Fellow jointly affiliated with MIT (CSAIL) and KU Leuven (ESAT). His research interests lie at the intersection between unsupervised learning and mathematical logic. His research ambition is to empower machines with the capability to acquire and discover knowledge from data in an autonomous manner. He was recently awarded the Marie Curie Global Fellowship for the program titled “Discovering the World through Unsupervised Statistical Relational Learning”.Ittai Rubinstein is a third-year PhD student in Computer Science at MIT, advised by Sam Hopkins, supported by the Mathworks EECS Fellowship. His research centers on algorithms, with a focus on data attribution and robust machine learning. Before MIT, he led a research team at Qedma working on quantum error suppression and mitigation. He holds a master’s degree in computer science from Tel Aviv University, and a bachelor’s degree in mathematics, physics, and computer science from the Technion.Abstracts:Self-supervised learning (SSL) has unlocked the ability to learn general-purpose representations from vast amounts of unlabeled data. Despite its successes, significant challenges remain, limiting the applicability and democratization of SSL. One key challenge lies in the failure modes that arise during SSL training. In this talk, we distill essential principles for reliably avoiding these known collapses. We introduce a principled yet simplified design of the projector and loss function for non-contrastive SSL, grounded in hyperdimensional computing. Theoretically, we show that this design induces an inductive bias that naturally encourages representations to become both decorrelated and clustered, without explicitly enforcing these properties. This bias provably improves generalization and is sufficient to prevent common training failures, including representation, dimensional, cluster, and intracluster collapses. We further validate our theoretical insights on image datasets, showing that our approach produces representations that retain richer information about the observed data while avoiding memorization. This opens the door to learning more structured representations.Data attribution estimates the effect of removing a set of samples from a model's training set without retraining the model from scratch and are used for interpretability, credit assignment, privacy and more. However, key approaches to data attribution significantly underestimate removal effects in the high-dimensional regime (#params >= Omega(#samples)), and existing theoretical analyses require strong convexity assumptions that rarely hold in practice, even for simple linear probes. In this talk, we will present a correction to the leading approaches to data attribution that improve accuracy in the high-dimensional regime and present the first theoretical guarantees for the accuracy of data attribution without strong convexity. TBD

September 22

Add to Calendar 2025-09-22 16:00:00 2025-09-22 17:00:00 America/New_York ML Tea: Bridging machine learning and optimization with computational metabolomics Speaker: Runzhong WangTitle: Bridging machine learning and optimization with computational metabolomicsAbstract:(This is a practice job talk)Solving hard optimization problems has been a longstanding challenge in computer science and beyond. Machine learning-based solvers turned out to be a promising direction, where problem patterns with certain distributions are captured to facilitate faster and more accurate problem-solving. We studied the theory and methodology of machine learning solvers for permutation-based combinatorial optimization and demonstrated the superiority of machine learning over existing methods. Going beyond, we transferred the insights to a long-standing problem in science with combinatorial nature—inferring molecular structures from liquid chromatography tandem mass spectrometry, a current bottleneck in computational metabolomics. We developed neural networks for in silico fragmentation, surpassing existing approaches by a significant margin, achieving 40% accuracy for annotating the exact structure as the top prediction and 92% accuracy in top 10. We demonstrated the utility of our approach in life science, environmental science, chemistry, and biology by real-world case studies. We expect the continuation of the research will not only enable new capabilities in science but also establish new insights in machine learning research.  TBD

September 15

Add to Calendar 2025-09-15 16:00:00 2025-09-15 17:00:00 America/New_York ML Tea: Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis / Consensus-Driven Active Model Selection Speakers: Aruna Sankaranarayanan and Justin KayBios:Aruna Sankaranarayanan is a PhD student supervised by Prof. Dylan Hadfield-Menell. Her research focuses on understanding and controlling human and model behavior, while improving model-human interactions. Her previous work includes studying how people distinguish deepfake videos from authentic ones, and investigating bias in opaque systems such as social-media advertising algorithms.Justin Kay is a third-year PhD student at MIT, advised by Sara Beery and supported by fellowships from MIT EECS and NSF. His research focuses on making computer vision and machine learning systems more deployable and informative for science and decision-making, particularly for environmental and climate applications. Abstracts:Where should we intervene on the internal activations of a large language model (LM) to control the naturalistic text it generates? Identifying effective steering locations in multi-token output settings is challenging because interventions can have complex, context-dependent effects, and evaluation often relies on costly human judgments or auxiliary models that provide only coarse feedback. To address this, we introduce contrastive causal mediation (CCM), a lightweight procedure for selecting steerable activation points by (1) constructing contrastive responses that succeed or fail in steering, (2) computing differences in generation probabilities, and (3) estimating the causal effect of hidden activations on these differences. We then situate CCM within a principled evaluation framework for representation engineering, which addresses four key desiderata: task-relevant contexts, consideration of model likelihoods, standardized comparisons across behaviors, and baseline methods. On 3 models, and across 3 task settings—refusal, bias-aware feedback, and style transfer, we conduct over 5400 experiments to show that CCM identifies effective intervention points under this recommended evaluation strategy. Together, these contributions demonstrate how combining causally grounded mechanistic interpretability with rigorous evaluation enables more effective and trustworthy control of large language models, even in naturalistic settings.The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset -- a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Our contribution is part of a larger research goal of how to best utilize human effort in the AI development and deployment lifecycle; while much prior research has focused on this question at training time, our work highlights the outsized benefits of emphasizing label-efficiency at test time as well.  TBD

May 05

Add to Calendar 2025-05-05 16:00:00 2025-05-05 17:00:00 America/New_York ML Tea: Algorithm Design with Learned Predictions Speakers: Justin ChenAbstract: The classic framing of algorithm design goes something like this: I give you a mathematical formulation of a problem by specifying valid inputs and outputs, and you give me an algorithm which, on any input, will produce a valid output using limited resources (e.g., time or memory). This “worst-case” analysis, which guarantees algorithmic correctness and efficiency over all possible inputs, is the foundation of computer science theory. This is for good reason: algorithms designed to perform well in the worst-case are reliable, composable, and require no assumptions or prior knowledge of their applications.In modern applications, algorithms are not run once in a vacuum on an unknown input. Procedures which process huge amounts of data are run hour after hour, day after day, on similar input datasets. These data may contain structure which is hard to formally incorporate into the algorithmic problem description. On the other hand, machine learning excels at extracting patterns which do not necessarily conform to simple-to-state, mathematical rules. Learned models make probabilistic predictions of outputs given inputs without a formal problem description or algorithm designer; they tune themselves on training data.In this talk, I will overview the developing area of "algorithms with predictions" as well as several opportunities and challenges with approaching ML for algorithms.Bios: Justin Y. Chen is a fifth-year PhD student studying theoretical computer science in the Electrical Engineering and Computer Science department at MIT, where he is advised by Piotr Indyk. He works on problems at the intersection of algorithm design, data analysis and machine learning. TBD

April 28

Add to Calendar 2025-04-28 16:00:00 2025-04-28 17:00:00 America/New_York ML Tea: Evaluating Multiple Models Using Labeled and Unlabeled Data Speakers: Shuvom SadhukaAbstract: It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1× relative to using labeled data alone and 2.4× relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models.Bio: Shuvom Sadhuka is a third-year PhD student in EECS, advised by Bonnie Berger. His research interests center on evaluation and uncertainty quantification, often with applications to biomedical data. In particular, he is interested in how to conduct evaluations of machine learning systems (both the data and models) along critical axes such as privacy and calibration in constrained settings (e.g., sparse or noisy labels). His PhD is supported by a Hertz Fellowship and NSF GRFP. Prior to MIT, he received an AB in Computer Science and Statistics from Harvard. TBD

April 23

Add to Calendar 2025-04-23 16:00:00 2025-04-23 17:00:00 America/New_York ML Tea: Do Large Language Model Benchmarks Test Reliability? Speakers: Josh Vendrow and Eddie VendrowAbstract: When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior.Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle.Bios: Josh is a third-year PhD student working with Aleksander Madry. Josh's research focuses on building machine learning models that are safe and robust when deployed in the real world. Eddie is a second-year PhD student advised by Sara Beery and supported by the MIT Presidential Fellowship and NSF GRFP. Eddie is interested in bringing automation to scientific discovery, including by building systems and agents that can autonomously carry out scientific data collection, data science, and analysis. TBD

April 14

Add to Calendar 2025-04-14 16:00:00 2025-04-14 17:00:00 America/New_York ML Tea: Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding Speakers: Tian Jin & Ellie ChengAbstract: Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teachers LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are the PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that allows LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21× to 1.93× with corresponding quality changes of +2.2% to -7.1%, measured as in length-controlled win rates.Bios: Tian Jin is a 5th-year Ph.D. student at MIT, advised by Michael Carbin and Jonathan Ragan-Kelley. His research focuses on machine learning and programming systems. Previously, Tian was a Research Engineer at IBM Research, where he led efforts to enable deep neural network inference on IBM mainframe machines and contributed to compiler support for the IBM Summit Supercomputer. He holds a dual degree in Computer Science and Mathematics from Haverford College.Ellie is a 3rd year PhD Student at CSAIL, advised by Michael Carbin. Her research interests are in the intersection of programming languages and machine learning. TBD

April 07

Add to Calendar 2025-04-07 16:00:00 2025-04-07 17:00:00 America/New_York ML Tea: Activation-Informed Merging of LLMs Speaker: Kaveh AlimohammadiTitle: Activation-Informed Merging of LLMsAbstract: Model merging has emerged as an efficient strategy for combining multiple fine-tuned large language models (LLMs) while avoiding the computational overhead of retraining. However, existing methods often overlook the importance of activation-space information in guiding the merging process. In this talk, I will introduce Activation-Informed Merging (AIM), a novel technique that enhances the robustness and performance of merged models by incorporating activation-space insights. AIM is designed as a complementary framework that can be applied to any merging approach, preserving critical weights from the base model through principles drawn from continual learning and model compression. By utilizing a task-agnostic calibration set, AIM selectively prioritizes essential parameters, leading to significant performance improvements across multiple benchmarks, with up to a 40% increase in effectiveness.   TBD

March 17

Add to Calendar 2025-03-17 16:00:00 2025-03-17 16:45:00 America/New_York ML Tea: Aggregating fMRI datasets for training brain-optimized models of human vision Speaker: Benjamin LahnerTitle: Aggregating fMRI datasets for training brain-optimized models of human visionAbstract: Large-scale fMRI datasets are revolutionizing our understanding of the neural processes underlying human perception, driving new breakthroughs in neuroscience and computational modeling. Yet individual fMRI data collection efforts remain constrained by practical limitations in scan time, creating an inherent tradeoff between subjects, stimuli, and stimulus repetitions. This tradeoff often compromises stimuli diversity, data quality, and generalizability of findings such that even the largest fMRI datasets cannot fully leverage the power of high-parameter artificial neural network models and high-dimensional feature spaces. To overcome these challenges, we introduce MOSAIC (Meta-Organized Stimuli And fMRI Imaging data for Computational modeling): a scalable framework for aggregating fMRI responses across multiple subjects and datasets. We preprocessed and registered eight event-related fMRI vision datasets (Natural Scenes Dataset, Natural Object Dataset, BOLD Moments Dataset, BOLD5000, Human Actions Dataset, Deeprecon, Generic Object Decoding, and THINGS) to the fsLR32k cortical surface space with fMRIPrep to obtain 430,007 fMRI-stimulus pairs over 93 subjects and 162,839 unique stimuli. We estimated single-trial beta values with GLMsingle (Prince et al., 2022), obtaining parameter estimates of similar or higher quality than the originally published datasets. Critically, we curated the dataset by eliminating stimuli with perceptual similarity above a defined threshold to prevent test-train leakage. This rigorous pipeline resulted in a well-defined stimulus-response dataset with 144,360 training stimuli, 18,145 test stimuli, and 334 synthetic stimuli well-suited for building and evaluating robust models of human vision. We show preliminary results using MOSAIC to investigate how the internal representations between brain-optimized neural networks differ from task-optimized neural networks and perform a large-scale decoding analysis that highlights the importance of stimulus set diversity. This framework empowers the vision science community to collaboratively generate a scalable, generalizable foundation for studying human vision.Bio: Ben Lahner is a PhD candidate in computational neuroscience working with Dr. Aude Oliva. His research combines fMRI data with machine learning and deep learning techniques to better understand facets of the human visual system. His previous work has investigated visual memory, action understanding, and video decoding from brain activity patterns.  TBD

March 10

Add to Calendar 2025-03-10 16:00:00 2025-03-10 17:00:00 America/New_York ML Tea: Unsupervised Discovery of Interpretable Structure in Complex Systems Speaker: Mark HamiltonAbstract: How does the human mind make sense of raw information without being taught how to see or hear? In this talk we will explore how to build algorithms that can uncover interpretable structure from large collections of unsupervised data like images and video. First, I will describe how to classify every pixel of a collection of images without any human annotations (Unsupervised semantic segmentation) by distilling self-supervised vision models. Second, we’ll see how this basic idea leads us to a new unifying theory of representation learning, and I will show how 20 different common machine learning methods such as dimensionality reduction, clustering, contrastive learning, and spectral methods emerge from a single unified equation. Finally, we’ll use this unified theory to create algorithms that can decode natural language just by watching unlabeled videos of people talking, without any knowledge of text. This work is the first step in our broader effort to translate animals using large scale, unsupervised, and interpretable learners, and the talk will conclude with some of our most recent efforts to analyze the complex vocalizations of Atlantic spotted dolphins.Bio: Mark Hamilton is a PhD student in William T Freeman's lab at the MIT Computer Science & Artificial Intelligence Laboratory. He is also a Senior Engineering Manager at Microsoft where he leads a team building a large-scale distributed ML products for Microsoft’s largest databases.  Mark is interested in how we can use unsupervised machine learning to discover scientific "structure" in complex systems. Mark values working on projects for social, cultural, and environmental good and aims to use his algorithms to help humans solve challenges they cannot solve alone. TBD

March 03

Add to Calendar 2025-03-03 16:00:00 2025-03-03 17:00:00 America/New_York ML Tea: Learning Generative Models from Corrupted Data Speaker: Giannis DarasAbstract: In scientific applications, generative models are used to regularize solutions to inverse problems. The quality of the models depends on the quality of the data on which they are trained. While natural images are abundant, in scientific applications access to high-quality data is scarce, expensive, or even impossible. For example, in MRI the quality of the scan is proportional to the time spent in the scanner and in black-hole imaging, we can only access lossy measurements. Contrary to high-quality data, noisy samples are generally more accessible. If we had a method to transform noisy points into clean ones, e.g., by sampling from the posterior, we could address these challenges. A standard approach would be to use a pre-trained generative model as a prior. But how can we train these priors in the first place without having access to data? We show that one can escape this chicken-egg problem using diffusion-based algorithms that account for the corruption at training time. We present the first algorithm that provably recovers the distribution given only noisy samples of a fixed variance. We extend our algorithm to account for heterogeneous data where each training sample has a different noise level. The underlying mathematical tools can be generalized to linear measurements with the potential of accelerating MRI. Our method has deep connections to the literature on learning supervised models from corrupted data, such as SURE and Noise2X. Our framework opens exciting possibilities for generative modeling in data-constrained scientific applications. We are actively working on applying this to denoise proteins and we present some first results in this direction.Bio: Giannis Daras is a postdoctoral researcher at MIT working closely with Prof. Costis Daskalakis and Prof. Antonio Torralba. Prior to MIT, Giannis completed his Ph.D. at UT Austin, under the supervision of Prof. Alexandros G. Dimakis. Giannis is interested in generative modelling and the applications of generative models to inverse problems. A key aspect of his work involves developing algorithms for learning generative models from noisy data. His research has broad implications across various fields, including scientific applications, privacy and copyright concerns, and advancing data-efficient learning techniques. TBD

February 24

Add to Calendar 2025-02-24 16:00:00 2025-02-24 17:00:00 America/New_York MLTea: Score-of-Mixture Training: One-Step Generative Model Training via Score Estimation of Mixture Distributions Abstract: We propose Score-of-Mixture Training (SMT), a novel framework for training one-step generative models by minimizing a class of divergences called the α-skew Jensen–Shannon divergence. At its core, SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels. Similar to consistency models, our approach supports both training from scratch (SMT) and distillation using a pretrained diffusion model, which we call Score-of-Mixture Distillation (SMD). It is simple to implement, requires minimal hyperparameter tuning, and ensures stable training. Experiments on CIFAR-10 and ImageNet 64×64 show that SMT/SMD are competitive with and can even outperform existing methods.Bio: Tejas is a final year PhD student in the Signals, Information and Algorithms Lab, advised by Professor Gregory Wornell. His research interests are centered around statistical inference, information theory and generative modeling with a recent focus on fundamental and applied aspects of score estimation and diffusion-based generative models. During his PhD, Tejas has interned at Meta AI, Google Research, Adobe Research and Mitsubishi Electric Research Labs. He is currently a recipient of the MIT Claude E. Shannon Fellowship.  TBD

February 19

Add to Calendar 2025-02-19 16:00:00 2025-02-19 17:00:00 America/New_York MLTea Talk: Theoretical Perspectives on Data Quality and Selection Abstract: Though the fact that data quality directly affects the quality of our prediction has always been understood, the large-scale data requirements of modern machine learning tasks has brought to fore the need to develop a richer vocabulary for understanding the quality of collected data towards predictions tasks of interest and the need to develop algorithms that most effectively use collected data. Though, this has been studied in various contexts such as distribution shift, multitask learning and sequential decision making, there remains a need to develop techniques to address problems faced in practice. Towards this aim of starting a dialogue between the practical and theoretical perspectives on these important problems. I will survey some recent techniques developed in TCS and statistics addressing data quality and selection.Bio: Abhishek Shetty is an incoming Catherine M. and James E. Allchin Early-Career Assistant Professor in the School of Computer Science at Georgia Tech and is currently FODSI Postdoctoral Fellow at MIT, hosted by Sasha Rakhlin, Ankur Moitra and Costis Daskalakis. He graduated from the department of EECS at UC Berkeley advised by Nika Haghtalab. His interests lie at the intersection of machine learning, theoretical computer science and statistics and is aimed at developing statistically and computationally efficient algorithms for inference. His research has been awarded with the Apple AI/ML fellowship and the American Statistical association SCGS best student paper. TBD

December 02

Truthfulness of Calibration Measures

Mingda Qiao
MIT CSAIL

Part Of

Add to Calendar 2024-12-02 16:00:00 2024-12-02 16:30:00 America/New_York Truthfulness of Calibration Measures Abstract: We initiate the study of the truthfulness of calibration measures in sequential prediction. A calibration measure is said to be truthful if the forecaster (approximately) minimizes the expected penalty by predicting the conditional expectation of the next outcome, given the prior distribution of outcomes. Truthfulness is an important property of calibration measures, ensuring that the forecaster is not incentivized to exploit the system with deliberate poor forecasts. This makes it an essential desideratum for calibration measures, alongside typical requirements, such as soundness and completeness.We conduct a taxonomy of existing calibration measures and their truthfulness. Perhaps surprisingly, we find that all of them are far from being truthful. That is, under existing calibration measures, there are simple distributions on which a polylogarithmic (or even zero) penalty is achievable, while truthful prediction leads to a polynomial penalty. Our main contribution is the introduction of a new calibration measure termed the Subsampled Smooth Calibration Error (SSCE) under which truthful prediction is optimal up to a constant multiplicative factor. Bio: Mingda Qiao a FODSI postdoc hosted by Ronitt Rubinfeld at the MIT Theory of Computation (TOC) Group, and an incoming assistant professor at UMass Amherst (starting Fall'25). His research focuses on the theory of prediction, learning, and decision-making in sequential settings, as well as collaborative federated learning. Prior to MIT, Mingda was a FODSI postdoc at UC Berkeley, received his PhD in Computer Science from Stanford University, and received his BEng in Computer Science from Tsinghua University.

November 25

Add to Calendar 2024-11-25 16:00:00 2024-11-25 17:00:00 America/New_York Power of inclusion: Enhancing polygenic prediction with admixed individuals Zoom Link: https://mit.zoom.us/j/94204370795?pwd=eFZwYXVuWmVsQzE1UTRZN2VtY0lkUT09 with passcode 387975Abstract: Predicting heritable traits and genetic liability of disease from individuals’ genomes has important implications for tailoring medical prevention and intervention strategies in precision medicine. Polygenic score (PGS), a statistical approach, has recently attracted substantial attention due to its potential relevance in clinical practice. Admixed individuals offer unique opportunities for addressing limited transferability in PGSs. However, they are rarely considered in PGS training, given the challenges in representing ancestry-matched linkage-disequilibrium reference panels for admixed individuals. Here we present inclusive PGS (iPGS), which captures ancestry-shared genetic effects by finding the exact solution for penalized regression on individual-level data and is thus naturally applicable to admixed individuals. We validate our approach in a simulation study across 33 configurations with varying heritability, polygenicity, and ancestry composition in the training set. When iPGS is applied to n = 237,055 ancestry-diverse individuals in the UK Biobank, it shows the greatest improvements in Africans by 48.9% on average across 60 quantitative traits and up to 50-fold improvements for some traits (neutrophil count, R2 = 0.058) over the baseline model trained on the same number of European individuals. When we allowed iPGS to use n = 284,661 individuals, we observed an average improvement of 60.8% for African, 11.6% for South Asian, 7.3% for non-British White, 4.8% for White British, and 17.8% for the other individuals. We further developed iPGS+refit to jointly model the ancestry-shared and -dependent genetic effects when heterogeneous genetic associations were present. For neutrophil count, for example, iPGS+refit showed the highest predictive performance in the African group (R2 = 0.115), which exceeds the best predictive performance for the White British group (R2 = 0.090 in the iPGS model), even though only 1.49% of individuals used in the iPGS training are of African ancestry. Our results indicate the power of including diverse individuals in developing more equitable PGS models.Bio: Yosuke Tanigawa, PhD, is a research scientist at MIT’s Computer Science and Artificial Intelligence Lab. To incorporate interindividual differences in disease prevention and treatment, he develops computational and statistical methods, focusing on predictive modeling with high-dimensional human genetics data, multi-omic dissection of disease heterogeneity, and therapeutic target discovery. His recent works focus on inclusive training strategies for genetic prediction algorithms and dissecting the molecular, cellular, and genetic basis of phenotypic heterogeneity in Alzheimer’s disease. He received many awards, including the Charles J. Epstein Trainee Awards for Excellence in Human Genetics Research and MIT Technology Review’s Innovators Under 35 Japan.

November 18

Dependence Induced Representation Learning

Xiangxiang Xu
EECS/RLE, MIT

Part Of

Add to Calendar 2024-11-18 16:00:00 2024-11-18 17:00:00 America/New_York Dependence Induced Representation Learning Abstract: Despite the vast progress in deep learning practice, theoretical understandings of learned feature representations remain limited. In this talk, we discuss three fundamental questions from a unified statistical perspective:(1) What representations carry useful information?(2) How are representations learned from distinct algorithms related?(3) Can we separate representation learning from solving specific tasks?We formalize representations that extract statistical dependence from data, termed dependence-induced representations. We prove that representations are dependence-induced if and only if they can be learned from specific features defined by Hirschfeld–Gebelein–Rényi (HGR) maximal correlation. This separation theorem signifies the key role of HGR features in representation learning and enables a modular design of learning algorithms. Specifically, we demonstrate the optimality of HGR features in simultaneously achieving different design objectives, including minimal sufficiency (Tishby's information bottleneck), information maximization, enforcing uncorrelated features (VICReg), and encoding information at various granularities (Matryoshka representation learning). We further illustrate that by adapting HGR features, we can obtain representations learned by distinct practices, from cross-entropy or hinge loss minimization, non-negative feature learning, and neural density ratio estimators to their regularized variants. We also discuss the applications of our analyses in interpreting learning phenomena such as neural collapse, understanding existing self-supervised learning practices, and obtaining more flexible designs, e.g., inference-time hyperparameter tuning.Bio: Xiangxiang Xu received the B.Eng. and Ph.D. degrees in electronic engineering from Tsinghua University, Beijing, China, in 2014 and 2020, respectively. He is a postdoctoral associate in the Department of EECS at MIT. His research focuses on information theory, statistical learning, representation learning, and their applications in understanding and developing learning algorithms. He is a recipient of the 2016 IEEE PES Student Prize Paper Award in Honor of T. Burke Hayes and the 2024 ITA (Information Theory and Applications) Workshop Sand Award.

November 13

October 28

Add to Calendar 2024-10-28 16:00:00 2024-10-28 17:00:00 America/New_York Generative Models for Biomolecular Prediction, Dynamics, and Design Abstract: We lay out the three avenues in which we think generative models are especially valuable for modeling biomolecules. 1) Hard prediction tasks can be better addressed with generative models that can suggest and rank multiple solutions (e.g. docking). 2) The dynamics and conformations of biomolecules can be captured with generative models (e.g. protein conformational ensembles and MD trajectories). 3) Designing new biomolecules can be accelerated, informed by samples or likelihoods from generative models (e.g. protein binder or regulatory DNA design). 32-G882 (Hewlett)

October 21

Add to Calendar 2024-10-21 16:00:00 2024-10-21 17:00:00 America/New_York Objective Approaches in a Subjective Medical World Abstract: In today’s healthcare system, patients often feel disconnected from clinical professionals and their care journey. They receive a “one-size-fits-all” plan and are left out of the decision-making process, which can lead to a less satisfying experience. My research focuses on applying advanced AI technologies, including large language models, machine learning, and IoT, to address challenges in healthcare, particularly in patient-centered healthcare delivery. I aim to enhance the accuracy and efficiency of healthcare systems by using these "objective approaches" to navigate the subjective aspects of medical practice, such as clinician notes and patient preferences found in electronic health records. A key aspect of my work is improving the transparency of AI-based healthcare applications, making them more understandable and trustworthy for both clinicians and patients, by addressing critical issues such as building trust in AI systems and ensuring these technologies effectively meet the needs of patients and healthcare providers. Additionally, I emphasize the importance of personalizing healthcare by considering each patient's unique circumstances, including their preferences and socio-economic conditions. This research applies AI across various areas, from specific diseases like cancer to broader healthcare contexts, with the goal of improving both the delivery and experience of healthcare. My work contributes to the development of AI tools that not only enhance clinical decision-making but also foster better human-AI interaction, ultimately leading to improved healthcare outcomes. 32-G882

October 16