Esha Choukse: At-scale Cross-Stack AI Inference Optimizations: From Request Scheduling to Datacenter Cooling
Title: At-scale Cross-Stack AI Inference Optimizations: From Request Scheduling to Datacenter Cooling
Abstract: With the ubiquitous use-cases of modern LLMs, the deployment scale of these models is unforeseen. This has led to a large-scale datacenter expansion with GPUs, currently running into an energy wall worldwide. This talk will focus on the properties of generative LLMs that can be used to make the deployment of these models more power efficient. The talk will also introduce POLCA, Splitwise, and DynamoLLM: three techniques to reduce the power consumption for the LLM serving.
Bio:
Esha Choukse is a Principal Researcher at Microsoft in the Azure Research – Systems group. Her current research focus is on efficient AI in the cloud, spanning the layers of AI platforms, hardware, and datacenter design and provisioning. In the past, Esha has also worked on sustainability, memory systems, and compression. Esha has a PhD from University of Texas at Austin and has published several papers at ISCA, ASPLOS, MICRO, HPCA, SC, and NSDI.
https://www.microsoft.com/en-us/research/people/eschouks/
This event will be in person and also broadcast over Zoom.