Towards Interpretable and Operationalized Fairness in Machine Learning
Speaker
Host
Thesis advisor: Lalana Kagal
Thesis committee: Peter Szolovits, Brian Hedden
Abstract
Machine learning systems are increasingly deployed in sensitive, real-world settings, yet persistent biases in model predictions continue to disadvantage marginalized groups. This thesis develops practical and interpretable methods for understanding and mitigating such biases in natural language generation and computer vision.
For large language models, we introduce a decoding-time approach that leverages small biased and anti-biased expert models to obtain a debiasing signal that is added to the LLM output. This approach combines computational efficiency - fine-tuning a small model versus re-training a large model and interpretability - one can examine the probability shift from debiasing.
In computer vision, we leverage concept bottleneck models (CBMs), which map images to human-understandable concepts, to improve transparency and help mask proxy features that correlate with sensitive attributes. To counter CBM information leakage and improve fairness-performance tradeoffs, we introduce three mitigation strategies: (1) reducing leakage with a top-k concept filter, (2) removing concepts that correlate strongly with gender, and (3) applying adversarial debiasing to further suppress sensitive information. Together, these contributions illustrate how interpretability and operationalization can make fairness interventions more trustworthy, scalable, and aligned with real deployment needs.