Add to Calendar
2025-11-17 16:00:00
2025-11-17 17:00:00
America/New_York
ML Tea: Domain-Aware Scaling Laws Uncover Data Synergy / Ambient Diffusion Omni: Training Good Models with Bad Data
Speakers: Kimia Hamidieh and Adrián RodríguezBios:1 - Kimia is a PhD student at MIT, and her research focuses on data-centric approaches to responsible AI, with emphasis on understanding how data composition shapes model capabilities. Her work has appeared at venues including ICLR, NeurIPS, AIES, and FAccT.2 - Adrián Rodríguez-Muñoz is a 4th year grad student at MIT EECS under the supervision of Prof. Antonio Torralba. His research focuses on learning how to use all data effectively, such as low-quality and out-of-distribution data in generative models, and even procedurally generated data in vision models, with direct applications to data-constrained domains such as science.Abstracts:1 - Machine learning progress is often attributed to scaling model size and dataset volume, yet the composition of data can be just as consequential. Empirical findings repeatedly show that combining datasets from different domains yields nontrivial interactions: adding code improves mathematical reasoning, while certain mixtures introduce interference that suppresses performance. We refer to these effects collectively as data synergy, interaction effects whereby the joint contribution of multiple domains exceeds (positive synergy) or falls short of (interference) the sum of their isolated contributions. In this work, we formalize and quantify dataset interactions in large language models. Leveraging observational variation across open-weight LLMs with diverse pretraining mixtures, we estimate both direct domain-to-benchmark synergy (how one domain contributes to performance on another) and pretraining data synergy (capabilities that require co-occurrence of multiple domains). Our framework improves predictive accuracy over domain-agnostic scaling laws, recovers stable synergy patterns such as math–code complementarity, and yields interpretable maps of cross-domain transfer. These results demonstrate that understanding and exploiting data synergy is essential for designing data mixtures and curating corpora in the next generation of foundation models.2 - The first part of the talk shows how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from all available images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. The second part of the talk explores how to iteratively evolve heterogeneous-quality datasets. We propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a dataset-model co-evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self-consuming loops, at each generation, we treat the synthetically improved samples as low-quality, but at a slightly higher quality level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption.
TBD
November 17
November 12
Add to Calendar
2025-11-12 16:00:00
2025-11-12 17:00:00
America/New_York
ML Tea: State, Polynomials, and Parallelism in a Time of Neural Sequence Modeling
Speaker: Morris YauTitle: State, Polynomials, and Parallelism in a Time of Neural Sequence ModelingAbstract: Is there an algorithm that learns the best fit parameters of a Transformer to any dataset? If I trained a neural sequence model and promised you it is equivalent to a program, how would you even be convinced? Modern RNN’s are functions that admit parallelizable recurrence; what is the design space of parallelizable recurrences? Are there unexplored function families that lie between RNN’s and Transformers? We explore these questions from first principles starting with state, polynomials, and parallelism.Speaker Bio: Morris is a final year PhD student in the labs of Prof. Jacob Andreas and Prof. Stefanie Jegelka. He studies the algorithmic foundations of neural sequence modeling and finds joy in exploring the power of simple ideas. He can frequently be found sipping tea on the 7'th floor of Schwarzman.
TBD
November 03
Add to Calendar
2025-11-03 16:00:00
2025-11-03 17:00:00
America/New_York
ML Tea: PDDL-Instruct: Enhancing Symbolic Planning Capabilities in LLMs through Logical Chain-of-Thought Instruction Tuning / Incentive-Aware Dynamic Pricing for Constrained Resource Allocation with Strategic Agents
Speakers: Pulkit Verma and Yan DaiBio 1 – Pulkit Verma is a Postdoctoral Associate at the Interactive Robotics Group at the Massachusetts Institute of Technology, where he works with Prof. Julie Shah. His research focuses on the safe and reliable behavior of taskable AI agents. He investigates the minimal set of requirements in an AI system that would enable a user to assess and understand the limits of its safe operability. He received his Ph.D. in Computer Science from Arizona State University, where he worked with Prof. Siddharth Srivastava. Before that, he completed his M.Tech. in Computer Science and Engineering at IIT Guwahati with Prof. Pradip K. Das. He was awarded the AAAI/ACM SIGAI Innovative AI Education Award at AAAI's EAAI Symposium in 2025, Graduate College Completion Fellowship at ASU in 2023, Post Graduation Scholarship from the Government of India in 2013 and 2014, and received the Best Demo Award at the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) in 2022.Bio 2 – Yan Dai is a 2nd-year PhD student in Operations Research, co-advised by Prof. Patrick Jaillet and Prof. Negin Golrezaei. His recent research focuses on tackling EconCS challenges via an online learning toolbox. He's also interested in bandits, reinforcement learning theory, and optimization for deep learning. He belongs to the communities of COLT, ICML, NeurIPS, and ICLR. He has won the Best Paper award at ACM SIGMETRICS 2025.Abstract 1 – Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, yet their ability to perform structured symbolic planning remains limited, particularly in domains requiring formal representations like Planning Domain Definition Language (PDDL). In this paper, we present a novel instruction tuning framework designed to enhance LLMs' symbolic planning capabilities through logical chain-of-thought reasoning. Our approach focuses on teaching models to rigorously reason about action applicability, state transitions, and plan validity using explicit logical inference steps. By developing instruction prompts that guide models through the precise logical reasoning required to determine when actions can be applied in a given state, we enable LLMs to self-correct their planning processes through structured reflection. The framework systematically builds verification skills by decomposing the planning process into explicit reasoning chains about precondition satisfaction, effect application, and invariant preservation. Experimental results on multiple planning domains show that our chain-of-thought reasoning based instruction-tuned models are significantly better at planning, achieving planning accuracy of up to 94% on standard benchmarks, representing a 66% absolute improvement over baseline models. This work bridges the gap between the general reasoning capabilities of LLMs and the logical precision required for automated planning, offering a promising direction for developing better AI planning systems.Abstract 2 – Motivated by applications such as cloud platforms allocating GPUs to users or governments deploying mobile health units across competing regions, we study the dynamic allocation of a reusable resource to strategic agents with private valuations. Our objective is to simultaneously (i) maximize social welfare, (ii) satisfy multi-dimensional long-term cost constraints, and (iii) incentivize truthful reporting. We begin by numerically evaluating primal-dual methods widely used in constrained online optimization and find them to be highly fragile in strategic settings -- agents can easily manipulate their reports to distort future dual updates for future gain. To address this vulnerability, we develop an incentive-aware framework that makes primal-dual methods robust to strategic behavior. Our design combines epoch-based lazy updates -- where dual variables remain fixed within each epoch -- with randomized exploration rounds that extract approximately truthful signals for learning. Leveraging carefully designed online learning subroutines that can be of independent interest for dual updates, our mechanism achieves $\tilde O(\sqrt T)$ social welfare regret, satisfies all cost constraints, and ensures incentive alignment. This matches the performance of non-strategic allocation approaches while being robust to strategic agents.
TBD
October 27
Add to Calendar
2025-10-27 16:00:00
2025-10-27 17:00:00
America/New_York
ML Tea: RL's Razor: Why On-Policy Reinforcement Learning Forgets Less
Speaker: Idan ShenfeldTitle: RL's Razor: Why On-Policy Reinforcement Learning Forgets LessAbstract: Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL consistently forgets less. We find that the degree of forgetting is determined by the distributional shift, namely the KL-divergence between the fine-tuned and base policy evaluated on the new task distribution. We discover that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. Our findings are empirically validated with large language models and controlled toy settings. Further, we provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle \textit{RL’s Razor}: among all ways to solve a new task, RL prefers those closest in KL to the original model.
TBD
October 20
Add to Calendar
2025-10-20 16:00:00
2025-10-20 17:00:00
America/New_York
ML Tea: Pandemic-Potential Viruses are a Blind Spot for Frontier Open-Source LLMs / Theoretical Guarantees for Learning with Unlabeled Data in Online Classification
Speakers: Laura Luebbert and Jonathan SchaferZoom password: 114091Abstracts:We study large language models (LLMs) for front-line, pre-diagnostic infectious-disease triage, a critically understudied stage in clinical interventions, public health, and biothreat containment. We focus specifically on the operational decision of classifying symptomatic cases as \emph{viral} vs. \emph{non-viral} at first clinical contact, a critical decision point for resource allocation, quarantine strategy, and antibiotic use. We create a benchmark dataset of first-encounter cases in collaboration with multiple healthcare clinics in Nigeria, capturing high-risk viral presentations in low-resource settings with limited data. Our evaluations across frontier open-source LLMs reveal that (1) LLMs underperform standard tabular models and (2) case summaries and Retrieval Augmented Generation yield only modest gains, suggesting that naïve information enrichment is insufficient in this setting. To address this, we demonstrate how models aligned with Group Relative Policy Optimization and a triage-oriented reward consistently improve baseline performance. Our results highlight persistent failure modes of general-purpose LLMs in pre-diagnostic triage and demonstrate how targeted reward-based alignment can help close this gap.Practitioners commonly train classifiers using unlabeled data in addition to labeled data, because labeled data is often harder to obtain. However from a theoretical perspective, the question of whether and how unlabeled data can offer provable benefits in classification tasks is still not fully understood. In this talk, I discuss two recent works on the power of unlabeled data in online learning, a popular mathematical model of supervised learning. We show that (1) for some concept classes, access to unlabeled data can guarantee a quadratic reduction in the number of learner mistakes, and (2) in all cases the reduction can never be more than quadratic. This resolves a problem that remained open for 30 years.
TBD
October 15
Add to Calendar
2025-10-15 16:00:00
2025-10-15 17:00:00
America/New_York
ML Tea: Chain-of-Thought Degrades Abstention in LLMs, Unless Inverted / Context-aware sequence-to-function model of human gene regulation
Speakers: Abinitha Gourabathina and Ekin Deniz AksuBios:Abinitha is a second year EECS PhD student in MIT LIDS. She is co-advised by Professors Marzyeh Ghassemi and Collin Stultz. Her research interests lie broadly in trustworthy machine learning, with a particular focus on sensitive domains like healthcare. She completed her B.S.E. in Operations Research and Financial Engineering at Princeton University. Ekin is a PhD student in the Max Planck Institute for Molecular Genetics in Berlin, working on computational biology to uncover the regulatory code of the human genome, he is an MD and has also worked on cancer biology and infectious diseases before.Abstracts:(1) For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Chain-of-Thought (CoT) prompting has been gained popularity for improving model performance by ensuring structured outputs that follow a logical sequence. In this paper, we first investigate how current abstention methods perform with CoT outputs, finding that direct use of reasoning traces can degrade performance of existing abstention methods by more than 5%. As a result, we introduce a new framework for thinking about hallucinations in LLMs not as answering a question incorrectly but instead as LLMs answering the wrong question. Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. We perform extensive experiments to find impressive performance gains with our Trace Inversion methods.(2) Sequence-to-function models have been very successful in predicting gene expression, chromatin accessibility, and epigenetic marks from DNA sequences alone. However, current state-of-the-art models have a fundamental limitation: they cannot extrapolate beyond the cell types and conditions included in their training dataset. Here, we introduce a new approach that is designed to overcome this limitation: Corgi, a new context-aware sequence-to-function model that accurately predicts genome-wide gene expression and epigenetic signals, even in previously unseen cell types. We designed an architecture that strives to emulate the cell: Corgi integrates DNA sequence and trans-regulator expression to predict the coverage of multiple assays including chromatin accessibility, histone modifications, and gene expression. We define trans-regulators as transcription factors, histone modifiers, transcriptional coactivators, and RNA binding proteins, which directly modulate chromatin states, gene expression, and mRNA decay. Trained on a diverse set of bulk and single cell human datasets, Corgi has robust predictive performance, approaching experimental-level accuracy in gene expression predictions in previously unseen cell types, while also setting a new state-of-the-art level for joint cross-sequence and cross-cell type epigenetic track prediction. Corgi can be used in practice to impute many assays including DNA accessibility and histone ChIP-seq from RNA-seq data.
TBD
October 06
Add to Calendar
2025-10-06 16:00:00
2025-10-06 17:00:00
America/New_York
ML Tea: Learning Safe Strategies for Value Maximizing Buyers in Uniform Price Auctions
Speaker: Sourav SahooTitle: Learning Safe Strategies for Value Maximizing Buyers in Uniform Price AuctionsAbstract: We study the bidding problem in repeated uniform price multi-unit auctions from the perspective of a value-maximizing buyer. The buyer aims to maximize their cumulative value over T rounds while adhering to per-round return-on-investment (RoI) constraints in a strategic (or adversarial) environment. Using an m-uniform bidding format, the buyer submits m bid-quantity pairs (bi,qi) to demand qi units at bid bi, with m≪M in practice, where M denotes the maximum demand of the buyer. We introduce the notion of safe bidding strategies as those that satisfy the RoI constraints irrespective of competing bids. Despite the stringent requirement, we show that these strategies satisfy a mild no-overbidding condition, depend only on the valuation curve of the bidder, and the bidder can focus on a finite subset without loss of generality. Though the subset size is O(Mm), we design a polynomial-time learning algorithm that achieves sublinear regret, both in full-information and bandit settings, relative to the hindsight-optimal safe strategy. We assess the robustness of safe strategies against the hindsight-optimal strategy from a richer class. We define the richness ratio α∈(0,1] as the minimum ratio of the value of the optimal safe strategy to that of the optimal strategy from richer class and construct hard instances showing the tightness of α. Our algorithm achieves α-approximate sublinear regret against these stronger benchmarks. Simulations on semi-synthetic auction data show that empirical richness ratios significantly outperform the theoretical worst-case bounds. The proposed safe strategies and learning algorithm extend naturally to more nuanced buyer and competitor models.
TBD
September 29
Add to Calendar
2025-09-29 16:00:00
2025-09-29 17:00:00
America/New_York
ML Tea: Collapse-Proof Non-Contrastive Self-Supervised Learning / Data Attribution in High Dimensions and without Strong Convexity
Speakers: Emanuele Sansone and Ittai RubinsteinBios:Emanuele Sansone is a Postdoctoral Fellow jointly affiliated with MIT (CSAIL) and KU Leuven (ESAT). His research interests lie at the intersection between unsupervised learning and mathematical logic. His research ambition is to empower machines with the capability to acquire and discover knowledge from data in an autonomous manner. He was recently awarded the Marie Curie Global Fellowship for the program titled “Discovering the World through Unsupervised Statistical Relational Learning”.Ittai Rubinstein is a third-year PhD student in Computer Science at MIT, advised by Sam Hopkins, supported by the Mathworks EECS Fellowship. His research centers on algorithms, with a focus on data attribution and robust machine learning. Before MIT, he led a research team at Qedma working on quantum error suppression and mitigation. He holds a master’s degree in computer science from Tel Aviv University, and a bachelor’s degree in mathematics, physics, and computer science from the Technion.Abstracts:Self-supervised learning (SSL) has unlocked the ability to learn general-purpose representations from vast amounts of unlabeled data. Despite its successes, significant challenges remain, limiting the applicability and democratization of SSL. One key challenge lies in the failure modes that arise during SSL training. In this talk, we distill essential principles for reliably avoiding these known collapses. We introduce a principled yet simplified design of the projector and loss function for non-contrastive SSL, grounded in hyperdimensional computing. Theoretically, we show that this design induces an inductive bias that naturally encourages representations to become both decorrelated and clustered, without explicitly enforcing these properties. This bias provably improves generalization and is sufficient to prevent common training failures, including representation, dimensional, cluster, and intracluster collapses. We further validate our theoretical insights on image datasets, showing that our approach produces representations that retain richer information about the observed data while avoiding memorization. This opens the door to learning more structured representations.Data attribution estimates the effect of removing a set of samples from a model's training set without retraining the model from scratch and are used for interpretability, credit assignment, privacy and more. However, key approaches to data attribution significantly underestimate removal effects in the high-dimensional regime (#params >= Omega(#samples)), and existing theoretical analyses require strong convexity assumptions that rarely hold in practice, even for simple linear probes. In this talk, we will present a correction to the leading approaches to data attribution that improve accuracy in the high-dimensional regime and present the first theoretical guarantees for the accuracy of data attribution without strong convexity.
TBD
September 22
Add to Calendar
2025-09-22 16:00:00
2025-09-22 17:00:00
America/New_York
ML Tea: Bridging machine learning and optimization with computational metabolomics
Speaker: Runzhong WangTitle: Bridging machine learning and optimization with computational metabolomicsAbstract:(This is a practice job talk)Solving hard optimization problems has been a longstanding challenge in computer science and beyond. Machine learning-based solvers turned out to be a promising direction, where problem patterns with certain distributions are captured to facilitate faster and more accurate problem-solving. We studied the theory and methodology of machine learning solvers for permutation-based combinatorial optimization and demonstrated the superiority of machine learning over existing methods. Going beyond, we transferred the insights to a long-standing problem in science with combinatorial nature—inferring molecular structures from liquid chromatography tandem mass spectrometry, a current bottleneck in computational metabolomics. We developed neural networks for in silico fragmentation, surpassing existing approaches by a significant margin, achieving 40% accuracy for annotating the exact structure as the top prediction and 92% accuracy in top 10. We demonstrated the utility of our approach in life science, environmental science, chemistry, and biology by real-world case studies. We expect the continuation of the research will not only enable new capabilities in science but also establish new insights in machine learning research.
TBD
September 15
Add to Calendar
2025-09-15 16:00:00
2025-09-15 17:00:00
America/New_York
ML Tea: Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis / Consensus-Driven Active Model Selection
Speakers: Aruna Sankaranarayanan and Justin KayBios:Aruna Sankaranarayanan is a PhD student supervised by Prof. Dylan Hadfield-Menell. Her research focuses on understanding and controlling human and model behavior, while improving model-human interactions. Her previous work includes studying how people distinguish deepfake videos from authentic ones, and investigating bias in opaque systems such as social-media advertising algorithms.Justin Kay is a third-year PhD student at MIT, advised by Sara Beery and supported by fellowships from MIT EECS and NSF. His research focuses on making computer vision and machine learning systems more deployable and informative for science and decision-making, particularly for environmental and climate applications. Abstracts:Where should we intervene on the internal activations of a large language model (LM) to control the naturalistic text it generates? Identifying effective steering locations in multi-token output settings is challenging because interventions can have complex, context-dependent effects, and evaluation often relies on costly human judgments or auxiliary models that provide only coarse feedback. To address this, we introduce contrastive causal mediation (CCM), a lightweight procedure for selecting steerable activation points by (1) constructing contrastive responses that succeed or fail in steering, (2) computing differences in generation probabilities, and (3) estimating the causal effect of hidden activations on these differences. We then situate CCM within a principled evaluation framework for representation engineering, which addresses four key desiderata: task-relevant contexts, consideration of model likelihoods, standardized comparisons across behaviors, and baseline methods. On 3 models, and across 3 task settings—refusal, bias-aware feedback, and style transfer, we conduct over 5400 experiments to show that CCM identifies effective intervention points under this recommended evaluation strategy. Together, these contributions demonstrate how combining causally grounded mechanistic interpretability with rigorous evaluation enables more effective and trustworthy control of large language models, even in naturalistic settings.The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset -- a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Our contribution is part of a larger research goal of how to best utilize human effort in the AI development and deployment lifecycle; while much prior research has focused on this question at training time, our work highlights the outsized benefits of emphasizing label-efficiency at test time as well.
TBD
May 05
Add to Calendar
2025-05-05 16:00:00
2025-05-05 17:00:00
America/New_York
ML Tea: Algorithm Design with Learned Predictions
Speakers: Justin ChenAbstract: The classic framing of algorithm design goes something like this: I give you a mathematical formulation of a problem by specifying valid inputs and outputs, and you give me an algorithm which, on any input, will produce a valid output using limited resources (e.g., time or memory). This “worst-case” analysis, which guarantees algorithmic correctness and efficiency over all possible inputs, is the foundation of computer science theory. This is for good reason: algorithms designed to perform well in the worst-case are reliable, composable, and require no assumptions or prior knowledge of their applications.In modern applications, algorithms are not run once in a vacuum on an unknown input. Procedures which process huge amounts of data are run hour after hour, day after day, on similar input datasets. These data may contain structure which is hard to formally incorporate into the algorithmic problem description. On the other hand, machine learning excels at extracting patterns which do not necessarily conform to simple-to-state, mathematical rules. Learned models make probabilistic predictions of outputs given inputs without a formal problem description or algorithm designer; they tune themselves on training data.In this talk, I will overview the developing area of "algorithms with predictions" as well as several opportunities and challenges with approaching ML for algorithms.Bios: Justin Y. Chen is a fifth-year PhD student studying theoretical computer science in the Electrical Engineering and Computer Science department at MIT, where he is advised by Piotr Indyk. He works on problems at the intersection of algorithm design, data analysis and machine learning.
TBD
April 28
Add to Calendar
2025-04-28 16:00:00
2025-04-28 17:00:00
America/New_York
ML Tea: Evaluating Multiple Models Using Labeled and Unlabeled Data
Speakers: Shuvom SadhukaAbstract: It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1× relative to using labeled data alone and 2.4× relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models.Bio: Shuvom Sadhuka is a third-year PhD student in EECS, advised by Bonnie Berger. His research interests center on evaluation and uncertainty quantification, often with applications to biomedical data. In particular, he is interested in how to conduct evaluations of machine learning systems (both the data and models) along critical axes such as privacy and calibration in constrained settings (e.g., sparse or noisy labels). His PhD is supported by a Hertz Fellowship and NSF GRFP. Prior to MIT, he received an AB in Computer Science and Statistics from Harvard.
TBD
April 23
Add to Calendar
2025-04-23 16:00:00
2025-04-23 17:00:00
America/New_York
ML Tea: Do Large Language Model Benchmarks Test Reliability?
Speakers: Josh Vendrow and Eddie VendrowAbstract: When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior.Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle.Bios: Josh is a third-year PhD student working with Aleksander Madry. Josh's research focuses on building machine learning models that are safe and robust when deployed in the real world. Eddie is a second-year PhD student advised by Sara Beery and supported by the MIT Presidential Fellowship and NSF GRFP. Eddie is interested in bringing automation to scientific discovery, including by building systems and agents that can autonomously carry out scientific data collection, data science, and analysis.
TBD
April 14
Add to Calendar
2025-04-14 16:00:00
2025-04-14 17:00:00
America/New_York
ML Tea: Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
Speakers: Tian Jin & Ellie ChengAbstract: Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teachers LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are the PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that allows LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21× to 1.93× with corresponding quality changes of +2.2% to -7.1%, measured as in length-controlled win rates.Bios: Tian Jin is a 5th-year Ph.D. student at MIT, advised by Michael Carbin and Jonathan Ragan-Kelley. His research focuses on machine learning and programming systems. Previously, Tian was a Research Engineer at IBM Research, where he led efforts to enable deep neural network inference on IBM mainframe machines and contributed to compiler support for the IBM Summit Supercomputer. He holds a dual degree in Computer Science and Mathematics from Haverford College.Ellie is a 3rd year PhD Student at CSAIL, advised by Michael Carbin. Her research interests are in the intersection of programming languages and machine learning.
TBD
April 07
Add to Calendar
2025-04-07 16:00:00
2025-04-07 17:00:00
America/New_York
ML Tea: Activation-Informed Merging of LLMs
Speaker: Kaveh AlimohammadiTitle: Activation-Informed Merging of LLMsAbstract: Model merging has emerged as an efficient strategy for combining multiple fine-tuned large language models (LLMs) while avoiding the computational overhead of retraining. However, existing methods often overlook the importance of activation-space information in guiding the merging process. In this talk, I will introduce Activation-Informed Merging (AIM), a novel technique that enhances the robustness and performance of merged models by incorporating activation-space insights. AIM is designed as a complementary framework that can be applied to any merging approach, preserving critical weights from the base model through principles drawn from continual learning and model compression. By utilizing a task-agnostic calibration set, AIM selectively prioritizes essential parameters, leading to significant performance improvements across multiple benchmarks, with up to a 40% increase in effectiveness.
TBD
March 17
Add to Calendar
2025-03-17 16:00:00
2025-03-17 16:45:00
America/New_York
ML Tea: Aggregating fMRI datasets for training brain-optimized models of human vision
Speaker: Benjamin LahnerTitle: Aggregating fMRI datasets for training brain-optimized models of human visionAbstract: Large-scale fMRI datasets are revolutionizing our understanding of the neural processes underlying human perception, driving new breakthroughs in neuroscience and computational modeling. Yet individual fMRI data collection efforts remain constrained by practical limitations in scan time, creating an inherent tradeoff between subjects, stimuli, and stimulus repetitions. This tradeoff often compromises stimuli diversity, data quality, and generalizability of findings such that even the largest fMRI datasets cannot fully leverage the power of high-parameter artificial neural network models and high-dimensional feature spaces. To overcome these challenges, we introduce MOSAIC (Meta-Organized Stimuli And fMRI Imaging data for Computational modeling): a scalable framework for aggregating fMRI responses across multiple subjects and datasets. We preprocessed and registered eight event-related fMRI vision datasets (Natural Scenes Dataset, Natural Object Dataset, BOLD Moments Dataset, BOLD5000, Human Actions Dataset, Deeprecon, Generic Object Decoding, and THINGS) to the fsLR32k cortical surface space with fMRIPrep to obtain 430,007 fMRI-stimulus pairs over 93 subjects and 162,839 unique stimuli. We estimated single-trial beta values with GLMsingle (Prince et al., 2022), obtaining parameter estimates of similar or higher quality than the originally published datasets. Critically, we curated the dataset by eliminating stimuli with perceptual similarity above a defined threshold to prevent test-train leakage. This rigorous pipeline resulted in a well-defined stimulus-response dataset with 144,360 training stimuli, 18,145 test stimuli, and 334 synthetic stimuli well-suited for building and evaluating robust models of human vision. We show preliminary results using MOSAIC to investigate how the internal representations between brain-optimized neural networks differ from task-optimized neural networks and perform a large-scale decoding analysis that highlights the importance of stimulus set diversity. This framework empowers the vision science community to collaboratively generate a scalable, generalizable foundation for studying human vision.Bio: Ben Lahner is a PhD candidate in computational neuroscience working with Dr. Aude Oliva. His research combines fMRI data with machine learning and deep learning techniques to better understand facets of the human visual system. His previous work has investigated visual memory, action understanding, and video decoding from brain activity patterns.
TBD
March 10
Add to Calendar
2025-03-10 16:00:00
2025-03-10 17:00:00
America/New_York
ML Tea: Unsupervised Discovery of Interpretable Structure in Complex Systems
Speaker: Mark HamiltonAbstract: How does the human mind make sense of raw information without being taught how to see or hear? In this talk we will explore how to build algorithms that can uncover interpretable structure from large collections of unsupervised data like images and video. First, I will describe how to classify every pixel of a collection of images without any human annotations (Unsupervised semantic segmentation) by distilling self-supervised vision models. Second, we’ll see how this basic idea leads us to a new unifying theory of representation learning, and I will show how 20 different common machine learning methods such as dimensionality reduction, clustering, contrastive learning, and spectral methods emerge from a single unified equation. Finally, we’ll use this unified theory to create algorithms that can decode natural language just by watching unlabeled videos of people talking, without any knowledge of text. This work is the first step in our broader effort to translate animals using large scale, unsupervised, and interpretable learners, and the talk will conclude with some of our most recent efforts to analyze the complex vocalizations of Atlantic spotted dolphins.Bio: Mark Hamilton is a PhD student in William T Freeman's lab at the MIT Computer Science & Artificial Intelligence Laboratory. He is also a Senior Engineering Manager at Microsoft where he leads a team building a large-scale distributed ML products for Microsoft’s largest databases. Mark is interested in how we can use unsupervised machine learning to discover scientific "structure" in complex systems. Mark values working on projects for social, cultural, and environmental good and aims to use his algorithms to help humans solve challenges they cannot solve alone.
TBD
March 03
Add to Calendar
2025-03-03 16:00:00
2025-03-03 17:00:00
America/New_York
ML Tea: Learning Generative Models from Corrupted Data
Speaker: Giannis DarasAbstract: In scientific applications, generative models are used to regularize solutions to inverse problems. The quality of the models depends on the quality of the data on which they are trained. While natural images are abundant, in scientific applications access to high-quality data is scarce, expensive, or even impossible. For example, in MRI the quality of the scan is proportional to the time spent in the scanner and in black-hole imaging, we can only access lossy measurements. Contrary to high-quality data, noisy samples are generally more accessible. If we had a method to transform noisy points into clean ones, e.g., by sampling from the posterior, we could address these challenges. A standard approach would be to use a pre-trained generative model as a prior. But how can we train these priors in the first place without having access to data? We show that one can escape this chicken-egg problem using diffusion-based algorithms that account for the corruption at training time. We present the first algorithm that provably recovers the distribution given only noisy samples of a fixed variance. We extend our algorithm to account for heterogeneous data where each training sample has a different noise level. The underlying mathematical tools can be generalized to linear measurements with the potential of accelerating MRI. Our method has deep connections to the literature on learning supervised models from corrupted data, such as SURE and Noise2X. Our framework opens exciting possibilities for generative modeling in data-constrained scientific applications. We are actively working on applying this to denoise proteins and we present some first results in this direction.Bio: Giannis Daras is a postdoctoral researcher at MIT working closely with Prof. Costis Daskalakis and Prof. Antonio Torralba. Prior to MIT, Giannis completed his Ph.D. at UT Austin, under the supervision of Prof. Alexandros G. Dimakis. Giannis is interested in generative modelling and the applications of generative models to inverse problems. A key aspect of his work involves developing algorithms for learning generative models from noisy data. His research has broad implications across various fields, including scientific applications, privacy and copyright concerns, and advancing data-efficient learning techniques.
TBD
February 24
Add to Calendar
2025-02-24 16:00:00
2025-02-24 17:00:00
America/New_York
MLTea: Score-of-Mixture Training: One-Step Generative Model Training via Score Estimation of Mixture Distributions
Abstract: We propose Score-of-Mixture Training (SMT), a novel framework for training one-step generative models by minimizing a class of divergences called the α-skew Jensen–Shannon divergence. At its core, SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels. Similar to consistency models, our approach supports both training from scratch (SMT) and distillation using a pretrained diffusion model, which we call Score-of-Mixture Distillation (SMD). It is simple to implement, requires minimal hyperparameter tuning, and ensures stable training. Experiments on CIFAR-10 and ImageNet 64×64 show that SMT/SMD are competitive with and can even outperform existing methods.Bio: Tejas is a final year PhD student in the Signals, Information and Algorithms Lab, advised by Professor Gregory Wornell. His research interests are centered around statistical inference, information theory and generative modeling with a recent focus on fundamental and applied aspects of score estimation and diffusion-based generative models. During his PhD, Tejas has interned at Meta AI, Google Research, Adobe Research and Mitsubishi Electric Research Labs. He is currently a recipient of the MIT Claude E. Shannon Fellowship.
TBD
February 19
Add to Calendar
2025-02-19 16:00:00
2025-02-19 17:00:00
America/New_York
MLTea Talk: Theoretical Perspectives on Data Quality and Selection
Abstract: Though the fact that data quality directly affects the quality of our prediction has always been understood, the large-scale data requirements of modern machine learning tasks has brought to fore the need to develop a richer vocabulary for understanding the quality of collected data towards predictions tasks of interest and the need to develop algorithms that most effectively use collected data. Though, this has been studied in various contexts such as distribution shift, multitask learning and sequential decision making, there remains a need to develop techniques to address problems faced in practice. Towards this aim of starting a dialogue between the practical and theoretical perspectives on these important problems. I will survey some recent techniques developed in TCS and statistics addressing data quality and selection.Bio: Abhishek Shetty is an incoming Catherine M. and James E. Allchin Early-Career Assistant Professor in the School of Computer Science at Georgia Tech and is currently FODSI Postdoctoral Fellow at MIT, hosted by Sasha Rakhlin, Ankur Moitra and Costis Daskalakis. He graduated from the department of EECS at UC Berkeley advised by Nika Haghtalab. His interests lie at the intersection of machine learning, theoretical computer science and statistics and is aimed at developing statistically and computationally efficient algorithms for inference. His research has been awarded with the Apple AI/ML fellowship and the American Statistical association SCGS best student paper.
TBD