September 18 '23

Multi-AI collaboration helps reasoning and factual accuracy in large language models

Written By

A novel approach allows multiple language models to collaborate, debating over several rounds, to converge on a unified and refined response (Credit: Alex Shipps/MIT CSAIL via Midjourney).

An age-old adage, often introduced to us during our formative years, is designed to nudge us beyond our self-centered, nascent minds: "Two heads are better than one." This proverb encourages collaborative thinking and highlights the potency of shared intellect.

Fast forward to 2023, and we find that this wisdom holds true even in the realm of artificial intelligence: Multiple language models, working in harmony, are better than one.

Recently, a team from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) embodied this ancient wisdom within the frontier of modern technology. They introduced a strategy that leverages multiple AI systems to discuss and argue with each other to converge on a best-possible answer to a given question. This method empowers these expansive language models to heighten their adherence to factual data and refine their decision-making.

The crux of the problem with large language models (LLMs) lies in the inconsistency of their generated responses, leading to potential inaccuracies and flawed reasoning. This new approach lets each agent actively assess every other agent’s responses, and uses this collective feedback to refine its own answer. In technical terms, the process consists of multiple rounds of response generation and critique. Each language model generates an answer to the given question, and then incorporates the feedback from all other agents to update its own response. This iterative cycle culminates in a final output from a majority vote across the models' solutions. It somewhat mirrors the dynamics of a group discussion — where individuals contribute to reach a unified and well-reasoned conclusion.

One real strength of the approach lies in its seamless application to existing black-box models. As the methodology revolves around generating text, it can also be implemented across various LLMs without needing access to their internal workings. This simplicity, the team says, could help researchers and developers use the tool to improve the consistency and factual accuracy of language model outputs across the board.

“Employing a novel approach, we don’t simply rely on a single AI model for answers. Instead, our process enlists a multitude of AI models, each bringing unique insights to tackle a question. Although their initial responses may seem truncated or may contain errors, these models can sharpen and improve their own answers by scrutinizing the responses offered by their counterparts," says Yilun Du, an MIT PhD student in electrical engineering and computer science, affiliate of MIT CSAIL, and lead author on a new paper about the work. "As these AI models engage in discourse and deliberation, they're better equipped to recognize and rectify issues, enhance their problem-solving abilities, and better verify the precision of their responses. Essentially, we're cultivating an environment that compels them to delve deeper into the crux of a problem. This stands in contrast to a single, solitary AI model, which often parrots content found on the internet. Our method, however, actively stimulates the AI models to craft more accurate and comprehensive solutions."

The research looked at mathematical problem-solving, including grade school and middle/high school math problems, and saw a significant boost in performance through the multi-agent debate process. Additionally, the language models showed off enhanced abilities to generate accurate arithmetic evaluations, illustrating potential across different domains.

The method can also help address the issue of "hallucinations" that often plague language models. By designing an environment where agents critique each other's responses, they were more incentivized to avoid spitting out random information and prioritize factual accuracy.

Beyond its application to language models, the approach could also be used for integrating diverse models with specialized capabilities. By establishing a decentralized system where multiple agents interact and debate, they could potentially use these comprehensive and efficient problem-solving abilities across various modalities like speech, video, or text.

While the methodology yielded encouraging results, the researchers say that existing language models may face challenges with processing very long contexts, and the critique abilities may not be as refined as desired. Furthermore, the multi-agent debate format, inspired by human group interaction, has yet to incorporate the more complex forms of discussion that contribute to intelligent collective decision-making — a crucial area for future exploration, the team says. Advancing the technique could involve a deeper understanding of the computational foundations behind human debates and discussions, and using those models to enhance or complement existing LLMs.

"Not only does this approach offer a pathway to elevate the performance of existing language models, but it also presents an automatic means of self-improvement. By utilizing the debate process as supervised data, language models can enhance their factuality and reasoning autonomously, reducing reliance on human feedback and offering a scalable approach to self-improvement," says Du. "As researchers continue to refine and explore this approach, we can get closer to a future where language models not only mimic human-like language but also exhibit more systematic and reliable thinking, forging a new era of language understanding and application."

"It makes so much sense to use a deliberative process to improve the model's overall output, and it's a big step forward from chain-of-thought prompting," says Anca Dragan, associate professor at the University of California at Berkeley’s Department of Electrical Engineering and Computer Sciences, who was not involved in the work. "I'm excited about where this can go next. Can people better judge the answers coming out of LLMs when they see the deliberation, whether or not it converges? Can people arrive at better answers by themselves deliberating with an LLM? Can a similar idea be used to help a user probe a LLM's answer in order to arrive at a better one?"

Du wrote the paper with three CSAIL affiliates: Shuang Li SM '20, PhD '23; MIT professor of electrical engineering and computer science Antonio Torralba; and MIT professor of computational cognitive science and Center for Brains, Minds, and Machines member Joshua Tenenbaum. Google DeepMind researcher Igor Mordatch was also a co-author.