Removing Biases from Molecular Representations via Information Maximization

Speaker

Chenyu Wang

EECS MIT

Host

Thien Le

CSAIL MIT

Abstract: High-throughput drug screening – using cell imaging or gene expression measurements as readouts of drug effect – is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE’s superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. The code is available at https://github.com/uhlerlab/InfoCORE.

Bio: I am a second-year PhD student at MIT EECS, advised by Tommi Jaakkola and Caroline Uhler. I am also affiliated with Eric and Wendy Schmidt Center (EWSC) at Broad Institute. My research interests lie broadly in machine learning, representation learning, and AI for science. Recently my research focuses on multi-modal representation learning and perturbation modelling for drug discovery. Before my PhD, I obtained my Bachelor’s degree from Tsinghua University.

Add to Calendar 2024-04-11 16:00:00 2024-04-11 16:30:00 America/New_York Removing Biases from Molecular Representations via Information Maximization Abstract: High-throughput drug screening – using cell imaging or gene expression measurements as readouts of drug effect – is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE’s superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. The code is available at https://github.com/uhlerlab/InfoCORE.Bio: I am a second-year PhD student at MIT EECS, advised by Tommi Jaakkola and Caroline Uhler. I am also affiliated with Eric and Wendy Schmidt Center (EWSC) at Broad Institute. My research interests lie broadly in machine learning, representation learning, and AI for science. Recently my research focuses on multi-modal representation learning and perturbation modelling for drug discovery. Before my PhD, I obtained my Bachelor’s degree from Tsinghua University. Room 32-370

Organizer & Contact

Thien Le

thienle@csail.mit.edu

Part of

ML Tea

Removing Biases from Molecular Representations via Information Maximization

Speaker

Host

April 11 2024

Location

Organizer & Contact

Part of

November 24

ML Tea: Planning and Problem-Solving with General, Scalable Neuro-Symbolic Models

November 17

ML Tea: Domain-Aware Scaling Laws Uncover Data Synergy / Ambient Diffusion Omni: Training Good Models with Bad Data

Removing Biases from Molecular Representations via Information Maximization

Speaker

Host

April 11 2024

Location

Organizer & Contact

Part of

Related Events

November 24

ML Tea: Planning and Problem-Solving with General, Scalable Neuro-Symbolic Models

November 17

ML Tea: Domain-Aware Scaling Laws Uncover Data Synergy / Ambient Diffusion Omni: Training Good Models with Bad Data