[Scale ML] Guangxuan Xiao: StreamingLLM and DuoAttention: Efficient and Effective Long Sequence Modeling for Large Language Models
Speaker
Guangxuan Xiao
Host
Scale ML
Title:
StreamingLLM and DuoAttention: Efficient and Effective Long Sequence Modeling for Large Language Models
Time + Location:
3pm Wednesday (Nov 20th)
45-792 (the main conference room on 7th floor)
Zoom: https://mit.zoom.us/j/91697262920 (password: mitmlscale)
Abstract:
Efficient deployment of Large Language Models (LLMs) for long-context applications, such as multi-turn dialogue and document processing, is critical yet challenging. Two primary hurdles include the high memory consumption of caching Key and Value (KV) states during decoding and the inability of LLMs to generalize beyond their training sequence lengths. In this talk, I will introduce StreamingLLM, a framework that enables LLMs trained on finite-length attention windows to generalize to virtually infinite sequence lengths without any fine-tuning. By uncovering the phenomenon of attention sinks, StreamingLLM achieves significant memory and computational efficiency while supporting sequence lengths of up to 4 million tokens. StreamingLLM provides up to 22.2x speedup over sliding window recomputation baselines, demonstrating stable and efficient language modeling for extreme sequence lengths.
However, while StreamingLLM excels at "non-stop chatting" tasks, it encounters context forgetting issues, limiting its effectiveness in preserving long-context abilities. To address this, I will present DuoAttention, a complementary framework that distinguishes between retrieval heads, which require full attention across all tokens, and streaming heads, which focus on local tokens and attention sinks. By applying full KV caching to retrieval heads and lightweight, constant-length caching to streaming heads, DuoAttention preserves long-context abilities, enabling real long-context modeling for LLMs. DuoAttention achieves up to 2.55x memory reduction and 2.18x decoding speedup for Multi-Head Attention (MHA) models. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU.
Supplementary Materials
* StreamingLLM Paper: https://arxiv.org/abs/2309.17453
* StreamingLLM Code: https://github.com/mit-han-lab/streaming-llm
* DuoAttention Paper: https://arxiv.org/abs/2410.10819
* DuoAttention Code: https://github.com/mit-han-lab/duo-attention
Speaker Bio:
Guangxuan Xiao is a third-year Ph.D. candidate at MIT EECS, advised by Prof. Song Han. He focuses on creating efficient algorithms for deep learning, especially for large language models (LLMs). His work has earned widespread attention, receiving over 9,000 GitHub stars and making a tangible impact on industry practices. His key contributions, including SmoothQuant and StreamingLLM, have been widely adopted and integrated into platforms such as NVIDIA's TensorRT-LLM, HuggingFace, and Intel's Neural Compressor.
StreamingLLM and DuoAttention: Efficient and Effective Long Sequence Modeling for Large Language Models
Time + Location:
3pm Wednesday (Nov 20th)
45-792 (the main conference room on 7th floor)
Zoom: https://mit.zoom.us/j/91697262920 (password: mitmlscale)
Abstract:
Efficient deployment of Large Language Models (LLMs) for long-context applications, such as multi-turn dialogue and document processing, is critical yet challenging. Two primary hurdles include the high memory consumption of caching Key and Value (KV) states during decoding and the inability of LLMs to generalize beyond their training sequence lengths. In this talk, I will introduce StreamingLLM, a framework that enables LLMs trained on finite-length attention windows to generalize to virtually infinite sequence lengths without any fine-tuning. By uncovering the phenomenon of attention sinks, StreamingLLM achieves significant memory and computational efficiency while supporting sequence lengths of up to 4 million tokens. StreamingLLM provides up to 22.2x speedup over sliding window recomputation baselines, demonstrating stable and efficient language modeling for extreme sequence lengths.
However, while StreamingLLM excels at "non-stop chatting" tasks, it encounters context forgetting issues, limiting its effectiveness in preserving long-context abilities. To address this, I will present DuoAttention, a complementary framework that distinguishes between retrieval heads, which require full attention across all tokens, and streaming heads, which focus on local tokens and attention sinks. By applying full KV caching to retrieval heads and lightweight, constant-length caching to streaming heads, DuoAttention preserves long-context abilities, enabling real long-context modeling for LLMs. DuoAttention achieves up to 2.55x memory reduction and 2.18x decoding speedup for Multi-Head Attention (MHA) models. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU.
Supplementary Materials
* StreamingLLM Paper: https://arxiv.org/abs/2309.17453
* StreamingLLM Code: https://github.com/mit-han-lab/streaming-llm
* DuoAttention Paper: https://arxiv.org/abs/2410.10819
* DuoAttention Code: https://github.com/mit-han-lab/duo-attention
Speaker Bio:
Guangxuan Xiao is a third-year Ph.D. candidate at MIT EECS, advised by Prof. Song Han. He focuses on creating efficient algorithms for deep learning, especially for large language models (LLMs). His work has earned widespread attention, receiving over 9,000 GitHub stars and making a tangible impact on industry practices. His key contributions, including SmoothQuant and StreamingLLM, have been widely adopted and integrated into platforms such as NVIDIA's TensorRT-LLM, HuggingFace, and Intel's Neural Compressor.