Enhancing Die-Stacked DRAM Resilience at Scale: The Journey from Research to Industry Standard

Speaker

Sudhanva Gurumurthi
AMD

Host

Mengjia Yan
MIT

Talk Abstract

JEDEC High Bandwidth Memory (HBM)™ is a widely used DRAM technology in AI and HPC SoCs due to its performance and energy efficiency benefits. Reliability, Availability, and Serviceability (RAS) are additional requirements for SoCs deployed at scale to ensure that computing is reliable and to minimize disruptions for long-running workloads such as AI training and scientific computing.

However, improving RAS for HBM is challenging due to a combination of architectural limitations and practical considerations for a standardized solution. In this talk, I shall first provide a primer on DRAM RAS and define the problem statement for HBM RAS. I shall then present the research that was carried out in collaboration with the memory industry to develop an improved HBM resilience architecture. I shall present the data and analyses that drove this work, specific decisions made as we navigated a space of various design options, and the rationale for each decision. The outcome of this effort was a new HBM RAS architecture that was adopted in the third generation HBM standard (HBM3), has been commercialized by DRAM manufacturers, and is now used in GPUs and AI accelerators across industry. This RAS architecture is also included in the fourth generation HBM standard (HBM4) that was announced by JEDEC earlier this year.  

Speaker Bio

Sudhanva Gurumurthi is a Fellow at AMD, where he is responsible for research and advanced development in RAS. His work has impacted numerous AMD products, multiple industry standards, and external research in the field. Before joining industry, Sudhanva was an Associate Professor in the Computer Science Department at the University of Virginia. He currently serves as the Editor-in-Chief of IEEE Computer Architecture Letters and on the College of Science and Engineering advisory board at Texas State University.  Sudhanva is the recipient of an NSF CAREER Award, a Google Focused Research Award, and is named to the ISCA Hall of Fame. He received his PhD in Computer Science and Engineering from Penn State in 2005.