ML Tea: PDDL-Instruct: Enhancing Symbolic Planning Capabilities in LLMs through Logical Chain-of-Thought Instruction Tuning / Incentive-Aware Dynamic Pricing for Constrained Resource Allocation with Strategic Agents

Speakers: Pulkit Verma and Yan Dai

Bio 1 – Pulkit Verma is a Postdoctoral Associate at the Interactive Robotics Group at the Massachusetts Institute of Technology, where he works with Prof. Julie Shah. His research focuses on the safe and reliable behavior of taskable AI agents. He investigates the minimal set of requirements in an AI system that would enable a user to assess and understand the limits of its safe operability. He received his Ph.D. in Computer Science from Arizona State University, where he worked with Prof. Siddharth Srivastava. Before that, he completed his M.Tech. in Computer Science and Engineering at IIT Guwahati with Prof. Pradip K. Das. He was awarded the AAAI/ACM SIGAI Innovative AI Education Award at AAAI's EAAI Symposium in 2025, Graduate College Completion Fellowship at ASU in 2023, Post Graduation Scholarship from the Government of India in 2013 and 2014, and received the Best Demo Award at the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) in 2022.

Bio 2 – Yan Dai is a 2nd-year PhD student in Operations Research, co-advised by Prof. Patrick Jaillet and Prof. Negin Golrezaei. His recent research focuses on tackling EconCS challenges via an online learning toolbox. He's also interested in bandits, reinforcement learning theory, and optimization for deep learning. He belongs to the communities of COLT, ICML, NeurIPS, and ICLR. He has won the Best Paper award at ACM SIGMETRICS 2025.

Abstract 1 – Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, yet their ability to perform structured symbolic planning remains limited, particularly in domains requiring formal representations like Planning Domain Definition Language (PDDL). In this paper, we present a novel instruction tuning framework designed to enhance LLMs' symbolic planning capabilities through logical chain-of-thought reasoning. Our approach focuses on teaching models to rigorously reason about action applicability, state transitions, and plan validity using explicit logical inference steps. By developing instruction prompts that guide models through the precise logical reasoning required to determine when actions can be applied in a given state, we enable LLMs to self-correct their planning processes through structured reflection. The framework systematically builds verification skills by decomposing the planning process into explicit reasoning chains about precondition satisfaction, effect application, and invariant preservation. Experimental results on multiple planning domains show that our chain-of-thought reasoning based instruction-tuned models are significantly better at planning, achieving planning accuracy of up to 94% on standard benchmarks, representing a 66% absolute improvement over baseline models. This work bridges the gap between the general reasoning capabilities of LLMs and the logical precision required for automated planning, offering a promising direction for developing better AI planning systems.

Abstract 2 – Motivated by applications such as cloud platforms allocating GPUs to users or governments deploying mobile health units across competing regions, we study the dynamic allocation of a reusable resource to strategic agents with private valuations. Our objective is to simultaneously (i) maximize social welfare, (ii) satisfy multi-dimensional long-term cost constraints, and (iii) incentivize truthful reporting. We begin by numerically evaluating primal-dual methods widely used in constrained online optimization and find them to be highly fragile in strategic settings -- agents can easily manipulate their reports to distort future dual updates for future gain. To address this vulnerability, we develop an incentive-aware framework that makes primal-dual methods robust to strategic behavior. Our design combines epoch-based lazy updates -- where dual variables remain fixed within each epoch -- with randomized exploration rounds that extract approximately truthful signals for learning. Leveraging carefully designed online learning subroutines that can be of independent interest for dual updates, our mechanism achieves $\tilde O(\sqrt T)$ social welfare regret, satisfies all cost constraints, and ensures incentive alignment. This matches the performance of non-strategic allocation approaches while being robust to strategic agents.