THESIS DEFENSE: Advancing Equity & Reliability in Machine Learning

Speaker

Divya Shanmugam
The data we collect are often not the data we wish we had. Healthcare data reflects patterns of underdiagnosis, demographic data is shaped by evolving social norms, and benchmark data can be unrepresentative of deployment settings. For domains in which flawed data is common, these systematic differences present a barrier to the widespread adoption of machine learning. In this talk, we aim to characterize and mitigate the impact of imperfect data on machine learning models. We address three ways in which data can be flawed: imperfect labels, coarse demographics, and limited evaluation datasets. First, we develop a method to correct for imperfect labels in the form of underdiagnosis between demographic cohorts. We then show how coarse race data obscures disparities across more granular race groups, suggesting existing algorithmic audits may significantly underestimate racial disparities in performance. Finally, we present a method to select between multiple machine learning models in the absence of abundant labeled data. In sum, we discuss work that represents a step towards a machine learning methodology that is robust to systematic errors in data collection across domains.

Zoom link: https://mit.zoom.us/j/92314938542