Advancing protein sequence analysis with protein language models

Speaker

Princeton University

Host

CSAIL, Mathematics

Abstract

Protein language models (PLMs) have emerged as transformative tools for understanding and interpreting protein sequences, enabling advances in structure prediction, functional annotation, and variant effect assessment directly from sequence alone. Yet realizing their full potential requires both algorithmic innovation and a deeper understanding of their capabilities and limitations. In this talk, I will present several recent developments that advance PLM-based protein sequence analysis along these dimensions. First, I will introduce Bag-of-Mer (BoM) pooling, a biologically inspired strategy for aggregating amino acid embeddings that can capture both local motifs and long-range interactions, improving performance on diverse tasks such as protein activity prediction, remote homology detection, and peptide–protein interaction prediction. Next, I will describe ARIES, a highly scalable multiple-sequence alignment algorithm that leverages PLM embeddings to achieve superior accuracy even in low-identity regions where traditional methods struggle. Finally, time permitting, I will discuss insights into PLM performance, including the roles of training data, sequence fit, and model architecture. Together, this work illustrates how PLMs can both power and reshape core computational biology tasks, while providing guidance for more effective and biologically grounded model development.

Speaker Bio

Mona Singh is the Wang Family Professor in Computer Science at Princeton University, where she is jointly appointed in the Computer Science department and the Lewis-Sigler Institute for Integrative Genomics. Mona obtained her AB and SM degrees at Harvard University, and her PhD at MIT, all three in Computer Science. She did postdoctoral work at the Whitehead Institute for Biomedical Research. She received the Presidential Early Career Award for Scientists and Engineers (PECASE). She is a Fellow of the International Society for Computational Biology, a Fellow of the Association for Computing Machinery and a Fellow of the American Institute for the Medical and Biological Engineering. She is currently Editor-In-Chief of the Journal of Computational Biology. She has been program committee chair for several major computational biology conferences, including ISMB (2010), WABI (2010), ACM-BCB (2012), and RECOMB (2016). She has been Chair of the NIH Modeling and Analysis of Biological Systems Study Section (2012-2014), and a council member of the Computing Community Consortium (2021-2024), and is currently on the steering committee for WABI.