ML Tea: Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis / Consensus-Driven Active Model Selection

Speakers: Aruna Sankaranarayanan and Justin Kay

Bios:

Aruna Sankaranarayanan is a PhD student supervised by Prof. Dylan Hadfield-Menell. Her research focuses on understanding and controlling human and model behavior, while improving model-human interactions. Her previous work includes studying how people distinguish deepfake videos from authentic ones, and investigating bias in opaque systems such as social-media advertising algorithms.

Justin Kay is a third-year PhD student at MIT, advised by Sara Beery and supported by fellowships from MIT EECS and NSF. His research focuses on making computer vision and machine learning systems more deployable and informative for science and decision-making, particularly for environmental and climate applications. 


Abstracts:

Where should we intervene on the internal activations of a large language model (LM) to control the naturalistic text it generates? Identifying effective steering locations in multi-token output settings is challenging because interventions can have complex, context-dependent effects, and evaluation often relies on costly human judgments or auxiliary models that provide only coarse feedback. To address this, we introduce contrastive causal mediation (CCM), a lightweight procedure for selecting steerable activation points by (1) constructing contrastive responses that succeed or fail in steering, (2) computing differences in generation probabilities, and (3) estimating the causal effect of hidden activations on these differences. We then situate CCM within a principled evaluation framework for representation engineering, which addresses four key desiderata: task-relevant contexts, consideration of model likelihoods, standardized comparisons across behaviors, and baseline methods. On 3 models, and across 3 task settings—refusal, bias-aware feedback, and style transfer, we conduct over 5400 experiments to show that CCM identifies effective intervention points under this recommended evaluation strategy. Together, these contributions demonstrate how combining causally grounded mechanistic interpretability with rigorous evaluation enables more effective and trustworthy control of large language models, even in naturalistic settings.

The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset -- a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Our contribution is part of a larger research goal of how to best utilize human effort in the AI development and deployment lifecycle; while much prior research has focused on this question at training time, our work highlights the outsized benefits of emphasizing label-efficiency at test time as well.