[Scale ML + MLSys Reading Group] Hymba: A Hybrid-head Architecture for Small Language Models

Speaker

Xin Dong

NVIDIA

Host

Scale ML + MLSys Reading Group

Speaker: Xin Dong
Topic: Hymba: A Hybrid-head Architecture for Small Language Models
Date: Wednesday, Jan 22
Time: 4:00 PM (EST)
Zoom: https://mit.zoom.us/j/91697262920 (password: mitmlscale)

Abstract
We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Bio

Xin Dong is a research scientist at NVIDIA Research and is interested in designing accurate, efficient and trustworthy systems for LLM and foundation models. Xin received PhD in Computer Science from Harvard University in 2023, advised by Professor H.T. Kung.

Add to Calendar 2025-01-22 16:00:00 2025-01-22 17:00:00 America/New_York [Scale ML + MLSys Reading Group] Hymba: A Hybrid-head Architecture for Small Language Models Speaker: Xin DongTopic: Hymba: A Hybrid-head Architecture for Small Language ModelsDate: Wednesday, Jan 22Time: 4:00 PM (EST)Zoom: https://mit.zoom.us/j/91697262920 (password: mitmlscale)AbstractWe propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.BioXin Dong is a research scientist at NVIDIA Research and is interested in designing accurate, efficient and trustworthy systems for LLM and foundation models. Xin received PhD in Computer Science from Harvard University in 2023, advised by Professor H.T. Kung. TBD

[Scale ML + MLSys Reading Group] Hymba: A Hybrid-head Architecture for Small Language Models

Speaker

Host

January 22 2025

Location