[Scale ML] Heyi Tang: Design and Optimization of Large-Scale Inference Systems of kimi.ai

[Scale ML] - Design and Optimization of Large-Scale Inference Systems of kimi.ai 
Speaker: Heyi Tang Host: Scale ML 

Date: Wednesday, March 12 Time: 7:15 PM - 8:15 PM 
Location: 45-792 (the main conference room on 7th floor) 
Zoom: https://mit.zoom.us/j/91697262920 (password: mitmlscale) 

Title: Design and Optimization of Large-Scale Inference Systems of kimi.ai 

Abstract: How can we efficiently scale large model inference in production? In this talk, we will start with a fundamental overview of LLM inference systems and the key challenges that emerge at scale. Traditional profiling and optimization techniques often fall short because they overlook the impact of token sequence length. To address this, we have developed a novel token-length-aware profiling method, which reveals significant inefficiencies when mixing short and long sequences. We will explore these slowdown issues in depth and present our innovative solutions to mitigate them. Finally, we will discuss additional challenges we've encountered and highlight open-ended questions that remain unsolved, inviting further research and collaboration. 

Bio: Heyi Tang is a Senior Engineer at Moonshot AI, specializing in large-scale inference systems. As part of the architecture team, he contributes to building the infrastructure behind kimi.ai, enabling it to process trillions of tokens daily. With a Ph.D. in Computer Networking, Heyi is interested in scaling AI systems and ensuring their seamless integration into production environments.