ML Tea: Domain-Aware Scaling Laws Uncover Data Synergy / Ambient Diffusion Omni: Training Good Models with Bad Data

Speakers: Kimia Hamidieh and Adrián Rodríguez

Bios:

1 - Kimia is a PhD student at MIT, and her research focuses on data-centric approaches to responsible AI, with emphasis on understanding how data composition shapes model capabilities. Her work has appeared at venues including ICLR, NeurIPS, AIES, and FAccT.

2 - Adrián Rodríguez-Muñoz is a 4th year grad student at MIT EECS under the supervision of Prof. Antonio Torralba. His research focuses on learning how to use all data effectively, such as low-quality and out-of-distribution data in generative models, and even procedurally generated data in vision models, with direct applications to data-constrained domains such as science.

Abstracts:

1 - Machine learning progress is often attributed to scaling model size and dataset volume, yet the composition of data can be just as consequential. Empirical findings repeatedly show that combining datasets from different domains yields nontrivial interactions: adding code improves mathematical reasoning, while certain mixtures introduce interference that suppresses performance. We refer to these effects collectively as data synergy, interaction effects whereby the joint contribution of multiple domains exceeds (positive synergy) or falls short of (interference) the sum of their isolated contributions. In this work, we formalize and quantify dataset interactions in large language models. Leveraging observational variation across open-weight LLMs with diverse pretraining mixtures, we estimate both direct domain-to-benchmark synergy (how one domain contributes to performance on another) and pretraining data synergy (capabilities that require co-occurrence of multiple domains). Our framework improves predictive accuracy over domain-agnostic scaling laws, recovers stable synergy patterns such as math–code complementarity, and yields interpretable maps of cross-domain transfer. These results demonstrate that understanding and exploiting data synergy is essential for designing data mixtures and curating corpora in the next generation of foundation models.

2 - The first part of the talk shows how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from all available images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. The second part of the talk explores how to iteratively evolve heterogeneous-quality datasets. We propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a dataset-model co-evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self-consuming loops, at each generation, we treat the synthetically improved samples as low-quality, but at a slightly higher quality level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption.