When Benchmarks Lie: >50% Error Rates, Misleading Rankings, and Unstable Training [Zoom Talk]
Abstract: "For better or worse, benchmarks shape a field" and this is also true for progress in AI for data systems. Take text-to-SQL: the BIRD leaderboard has ~100 AI agents from groups ranging from Stanford to Google. Can we trust these leaderboards as researchers looking to develop new techniques or practitioners looking to choose high-performing agents?
Unfortunately we cannot. We show that text-to-SQL leaderboards, including BIRD and Spider 2.0, have >50% error rates! We further show that these errors result in poor correlation on leaderboard rankings compared to clean data, with rank correlations as low as 0.3. Finally, we show that AI agents, when trained with reinforcement learning, can learn incorrect patterns or collapse depending on the kind of noise. Finally, we show that on a range of challenging data benchmarks, AI agents still struggle. Our results show the incredible need for high-quality data to push the field forward.
Bio: Daniel is a professor of computer science at UIUC, where he focuses on everything related to AI agents and data. His lab has recently focused on understanding what AI agent benchmarks really measure and how that affects downstream performance, both for those who want to select high-performing agents and agent developers. Daniel's work is supported by the Google ML and Systems Junior Faculty Award, Bridgewater AIA Labs, and others.
----
Please reach out to markakis@mit.edu for the Zoom password.