ML Efficiency for Large Models: Faster Transformers, Sparsity, and beyond

Speaker

Google Research

Host

Noah Golowich

MIT

*Updated time -- 2-3pm (previous time was incorrect)*

Abstract: Scaling large models efficiently for faster training and inference is a fundamental challenge. In this talk, we present a number of algorithmic challenges and potential solutions from theory to practice. First, we discuss data efficiency and model efficiency problems that can be formalized as a subset selection problem. For model efficiency, we present sequential attention for feature selection and sparsification[ICLR'23, arxiv]. For data efficiency, we present a sensitivity sampling technique for improved quality and efficiency of the models. Furthermore, we discuss the intrinsic quadratic complexity of attention models as well as token generation. We first discuss HyperAttention; a technique to develop linear-time attention algorithms under mild assumptions[ICLR'24]. We then present PolySketchFormer, a technique to bypass the hardness results of achieving sub-quadratic attention by applying sketching of polynomial functions[arxiv]. Finally, we show how to address the complexity of token generation via clustering techniques[arxiv].

Bio: Vahab Mirrokni is a Google Fellow and VP of Research at Google New York, leading a number of algorithm and optimization research groups including market algorithms, large-scale graph mining, and large-scale optimization. Previously he was a distinguished scientist and senior research director at Google. He received his PhD from MIT in 2005 and his B.Sc. from Sharif University of Technology in 2001. He joined Google Research in 2008, after research positions at Microsoft Research, MIT and Amazon.com. He is the co-winner of best paper awards at KDD, ACM EC, SODA, and Informs Revenue Management. His research areas include algorithms, ML optimization, and computational economics. Recently he has been working on algorithmic problems in the space of ML efficiency, online advertising, and LLMs. His publications by year can be found here.

Add to Calendar 2024-04-11 14:00:00 2024-04-11 15:00:00 America/New_York ML Efficiency for Large Models: Faster Transformers, Sparsity, and beyond *Updated time -- 2-3pm (previous time was incorrect)*Abstract: Scaling large models efficiently for faster training and inference is a fundamental challenge. In this talk, we present a number of algorithmic challenges and potential solutions from theory to practice. First, we discuss data efficiency and model efficiency problems that can be formalized as a subset selection problem. For model efficiency, we present sequential attention for feature selection and sparsification[ICLR'23, arxiv]. For data efficiency, we present a sensitivity sampling technique for improved quality and efficiency of the models. Furthermore, we discuss the intrinsic quadratic complexity of attention models as well as token generation. We first discuss HyperAttention; a technique to develop linear-time attention algorithms under mild assumptions[ICLR'24]. We then present PolySketchFormer, a technique to bypass the hardness results of achieving sub-quadratic attention by applying sketching of polynomial functions[arxiv]. Finally, we show how to address the complexity of token generation via clustering techniques[arxiv]. Bio: Vahab Mirrokni is a Google Fellow and VP of Research at Google New York, leading a number of algorithm and optimization research groups including market algorithms, large-scale graph mining, and large-scale optimization. Previously he was a distinguished scientist and senior research director at Google. He received his PhD from MIT in 2005 and his B.Sc. from Sharif University of Technology in 2001. He joined Google Research in 2008, after research positions at Microsoft Research, MIT and Amazon.com. He is the co-winner of best paper awards at KDD, ACM EC, SODA, and Informs Revenue Management. His research areas include algorithms, ML optimization, and computational economics. Recently he has been working on algorithmic problems in the space of ML efficiency, online advertising, and LLMs. His publications by year can be found here. 32-D507

Organizer & Contact

Noah Golowich

nzg@csail.mit.edu

Part of

Algorithms and Complexity (A&C) 2024 - 2025

ML Efficiency for Large Models: Faster Transformers, Sparsity, and beyond

Speaker

Host

April 11 2024

Location

Organizer & Contact

Part of

May 14

Catalytic Computing: A Primer

May 01

Understanding the Trade-Offs Between Hallucinations and Mode Collapse in Language Generation

ML Efficiency for Large Models: Faster Transformers, Sparsity, and beyond

Speaker

Host

April 11 2024

Location

Organizer & Contact

Part of

Related Events

May 14

Catalytic Computing: A Primer

May 01

Understanding the Trade-Offs Between Hallucinations and Mode Collapse in Language Generation