Embodied Intelligence (EI) Joint Seminar Presentation

Speaker

Isha Puri & Shannon Shen
MIT CSAIL

Host

Yoon Kim
MIT CSAIL

There will be a joint presentation this week by two MIT CSAIL PhD candidates.

Title:  A Probabilistic Inference Approach to Inference-Time Scaling of LLMs

Abstract:  Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code, videos, and further information available at https://probabilistic-inference-scaling.github.io/.

Bio:  Isha Puri is a PhD student at MIT CSAIL coadvised by Professors Yoon Kim and Marzyeh Ghassemi. She graduated from Harvard University in 2023 with a B.A. in Applied Mathematics and Computer Science, where she was a HBS Technology Innovation Fellow. Isha's work has been published at venues such as NeurIPS, EMNLP, TMLR, and ICML. Her current interests lie in studying how language models can be efficiently integrated into real world workflows via AI Agents in a variety of contexts.


Title:  Designing and Evaluating LLM Agents through the lens of Collaborative Effort Scaling

Abstract:  Agents powered by large language models have been increasingly used to support end users in real-world tasks like code generation or data analysis. These complex tasks typically require multi-turn long-form collaborations between humans and agents. As human involvement is an innate component in this process, we argue a primary objective of helpful agents is to effectively leverage human effort to improve performance, which we call as collaborative effort scaling. Unlike existing outcome-based agent benchmarks, we propose two new dimensions to evaluate the collaboration process inspired by case studies from five domains: (1) scalability - how agents continuously improve with additional human involvement, and (2) feasibility - how much human effort agents can utilize before users stop the interaction. We conduct controlled simulation experiments on tabular analysis and travel planning agents, and they reveal limitations in existing agent systems' ability to leverage iterative feedback. Our analysis shows that the evaluation framework can provide helpful insights for both agent developers seeking to optimize collaborative capabilities and end users selecting appropriate interaction strategies with agents.

Bio:  Shannon Shen is currently a 3rd year PhD student at CSAIL, advised by Prof. David Sontag. His research lies at the intersection between NLP and HCI. Right now he focuses on understanding human and LLM agent collaboration and builds methods to support that, e.g., improving the verifiability of LLM generations. He has also developed impactful software for document-parsing that has been downloaded more than 5 million times and won best paper awards at EMNLP and ACL.