July 10 '25

Speeding up LLM inference with parallelism

Written By

The CSAIL team’s Parallel Structure Annotation (PASTA) enables LLMs to generate text in parallel, dramatically accelerating their response times (Credit: Pixabay).

As large language models (LLMs) like ChatGPT continue to advance, user expectations of them keep growing, including with respect to how quickly they can respond to our increasingly intricate prompts requesting answers to ever-challenging problems and tasks.

Conventional LLMs rely on the concept of “autoregressive decoding,” where each item (“token”) in a sequence is predicted based on previously generated outputs. This approach inevitably leads to delays for more complicated prompts, though researchers have tried to mitigate this with projects that leverage the parallelism of multicore computer chips more effectively. For example, speculative decoding uses a fast draft model to propose tokens that are then verified in parallel by a slower, high-quality model. A newer class of methods instead exploits “semantic independence,” identifying syntactic patterns like bullet points and expanding each in parallel. But they rely on hand-crafted syntactic heuristics, which are brittle and often fail when responses deviate from expected formats.

These shortcomings inspired researchers at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL) and Google to use a learning-based approach to parallel decoding. Instead of relying on fixed rules, their method trains LLMs to recognize semantic independence—that is, to identify and decode semantically independent chunks of text in parallel.

The result: pasta.

Specifically, the CSAIL team’s Parallel Structure Annotation (PASTA) enables LLMs to generate text in parallel, dramatically accelerating their response times. Unlike previous attempts that relied on rigid, hand-coded rules to identify independent text segments, PASTA teaches LLMs to inherently understand and express these parallelization opportunities within their own responses. This approach — called learned asynchronous decoding — marks a shift toward teaching models to orchestrate their own parallel decoding strategy.

"Traditional LLMs are like a single cook making lasagna, one step at a time," explained Tian Jin, lead author of a new paper on the project that will be presented in July at the International Conference on Machine Learning (ICML) in Vancouver. "PASTA teaches the cook to recognize when different parts of the lasagna can be prepared simultaneously, like mixing a subset of ingredients while the oven preheats, leading to a much faster process overall."

This innovation tackles a fundamental bottleneck in LLM inference, where the sequential nature of decoding often results in underutilized hardware and lengthy wait times for users. Current LLMs can take seconds or even minutes to fulfill user requests, a latency issue that PASTA aims to resolve.

At the heart of PASTA are two main components: PASTA-LANG, an annotation language that allows LLMs to tag semantically independent parts of their responses, and an interpreter that acts on these tags to orchestrate parallel decoding during inference. As Jin explains, you can think of PASTA-LANG as a set of instructions the LLM writes for itself, marking sections of its output that can be worked on simultaneously. The interpreter then reads these instructions and manages the parallel generation of those sections.

The team trained LLMs to generate these PASTA-LANG annotations through a two-stage fine-tuning process. This training not only optimizes for decoding speed but also approximately maintains or even improves the quality of the generated responses. This dual optimization is a significant leap forward, as it enables continuous improvements on both speed and quality as more training compute becomes available.

In experiments conducted with PASTA on the AlpacaEval benchmark used, the team’s self-parallelizing model showed geometric mean speedups reaching nearly 2x while experiencing only minor changes in response quality (from a gain of 2 percent to a drop of 7 percent). This means users can expect responses nearly twice as fast without a noticeable decrease in accuracy or coherence.

“It was surprising to see this behavior of having an LLM orchestrate its own inference-time behavior,” Jin says. “It was illuminating — and in a way, magical — to see how throwing more compute at these algorithms yields increasingly sophisticated self-orchestration behavior.”

The research highlights a critical challenge in the field: balancing speed and quality. Prior methods such as Skeleton-of-Thought (SoT) and APAR attempted parallel decoding by looking for manually specified syntactic structures like bullet points or paragraphs. However, these methods were often rigid and imprecise, failing to identify parallelization opportunities when responses deviated even slightly from expected patterns. PASTA's learning-based approach, in contrast, offers a more robust and scalable solution.

"It's about empowering the LLM to be smarter about how it generates content," says Jin, a PhD student at CSAIL. "Instead of us trying to guess where it can work in parallel, we're teaching the LLM to identify those opportunities itself, on the fly."

Looking ahead, the team is optimistic about the broader implications of PASTA. The ability to significantly reduce LLM decoding latency could lead to reduced computational resource requirements, making these powerful AI models more accessible and affordable to a wider range of users and applications.

“We’ve essentially designed a protocol for an LLM to optimize itself,” says Jin. "By improving the efficacy of LLM inference, PASTA could significantly reduce computational resource requests and improve accessibility of LLMs.”

Jin spearheaded the project alongside his two faculty advisers, MIT professors Michael Carbin and Jonathan Ragan-Kelley. Other paper co-authors include CSAIL’s Ellie Y. Cheng and Zack Ankner, and Google researchers Suvinay Subramanian, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh.

The work was supported in part by the Sloan Foundation and SRC JUMP 2.0 (CoCoSys), as well as the Google CoreML Performance Team, Google Research and Google DeepMind.

People

Tian Jin

Ellie Cheng

Michael Carbin

Jonathan Ragan-Kelley

Research Areas

Programming Languages & Software Engineering

Press Contact

Rachel Gordon

E rachelg@csail.mit.edu

T 258-0675

Speeding up LLM inference with parallelism

Written By

People

Tian Jin

Ellie Cheng

Michael Carbin

Jonathan Ragan-Kelley

Research Areas

Press Contact

Rachel Gordon

High-performance computing, with much less code

High-performance computing, with much less code

A programming language for hardware accelerators

A programming language for hardware accelerators

Speeding up LLM inference with parallelism

Written By

People

Tian Jin

Ellie Cheng

Michael Carbin

Jonathan Ragan-Kelley

Research Areas

Press Contact

Rachel Gordon

Related News

High-performance computing, with much less code

High-performance computing, with much less code

A programming language for hardware accelerators

A programming language for hardware accelerators