Grafik: Menschen, die an einem Chatbot arbeiten, der mit einer Monte Carlo Tree Search arbeitet.
Artificial Intelligence

MCTS meets LLMs: Enabling Complex Reasoning and Strategic Planning

Lesezeit
15 ​​min

In the dynamic world of artificial intelligence, Large Language Models (LLMs) have emerged as groundbreaking tools, offering exciting possibilities for innovation and research. However, their effectiveness is often hampered by limitations in handling tasks that demand a deeper understanding beyond text or require nuanced common-sense reasoning and extensive world knowledge. Addressing these limitations is crucial for advancing towards the goal of Artificial General Intelligence (AGI).

This blog post introduces a novel framework designed to empower LLMs with more sophisticated decision-making abilities through the strategic use of advanced planning algorithms. Specifically, this approach is particularly exemplified through a case study on Visual Question Answering (VQA) using Monte Carlo Tree Search (MCTS), demonstrating the power of our framework in enabling LLMs to act more autonomously and effectively within an environment.

LLMs as Agents

The term “agents“ might sound familiar to many who have followed the Machine Learning (ML) community for longer, since it was already frequently used in the context of Reinforcement Learning (RL) back when DeepMind introduced AlphaZero and AlphaGo. The agents back then were manifested by relatively small neural networks, compared to the LLM powerhouses we have today. Thus, the question arises: why not use even more powerful networks in the form of LLMs?

Instructing and Reasoning

An agent executes actions and observes their consequences, meaning how the world changes and what feedback it receives – which can be thought of as a reward or punishment. Doing so requires two “skills“ for LLMs: understanding the instructions and being able to reason about the information available. Since we all are familiar with GPT by now, we know exactly what that means: you input text, alias the prompt, and you get text as output.

Being able to understand instructions is crucial to not only understand the task at hand but also to concisely follow the instructions written in the prompt. In general, both recognizing the task and its demands are intuitive for humans but a lot harder to infuse into an ML model.

Left: the standard way of questioning an LLM. Right: elicit “thought“ behavior, as part of the prompt.

Given we understand that some task needs to be done, we want to instruct it with a simple task. What does simple even mean? In the figure above on the left side, we can see that a seemingly simple task is answered incorrectly. This started the research community to explore ways to tackle such mistakes with prompting techniques to avoid touching the model parameters which in the case of proprietary models might not even be possible.

The approach on the right, called Chain-of-Thought prompting, is one such example that changes the prompt structure and aims to elicit a more thoughtful response. Note that we still deal with a black box mode that is far from perfect but it is a good start to make the LLM behave in an intended way.

MCTS as Environment and Tools

Now that we talked about the agent, we need to define the “world“ or environment it operates in. This requires setting the rules and settings of the world this agent interacts with. For finite, well-defined games like chess this is quite simple and can be hardcoded (we ignore the value estimation process since out of scope) as we can check what are illegal moves, who won in the end, etc. Specifically, to plan strategically one needs to foresee possible consequences of actions, preferably tested in parallel, and such simulations, or rollouts as RL enthusiasm might remember, should be given a reasonable reward.

Left: RL formulation of action, state, and reward with LLM as agent. Environment dynamics are handled by MCTS, making updates to the tree based on LLMs decisions. Right: Four stages of MCTS that grow the tree, progressing and learning over time.

Repeating this process of exploring different actions and analyzing the consequences and being able to do so iteratively allows the agent to refine its decision-making, much like trial and error learning.

Additionally, actions might sometimes require functionality or capabilities not directly built into the agent. Hence, utilizing tools has become something that researchers tried to pair LLMs with. These tools can manifest in various forms, including web search, API utilization, and, more recently, instructing robotic systems through textual commands provided by the LLM itself, thereby bridging the gap between software and hardware.

Case Study WebQA: Tackling a Multi-modal, Multi-hop Question Answering Benchmark

Wow, that was a mouthful! We will first break down what we mean by that, alias what the task is about exactly and its characteristics, as well as how exactly an LLM can be seen as an agent to tackle the given task.

Task Description

WebQA is a VQA benchmark, this means the objective is to provide answers to textual questions by leveraging images as potential contextual cues, which may contain partial or complete information required to answer the overall question.

Multimodal denotes the property wherein the input encompasses not only textual data but also incorporates visual elements. Multihop, on the other hand, signifies the necessity of gathering information from multiple sources – be it text or image – to obtain all the necessary data. Lastly, the predicted answer is expected to be fluent text, which poses a greater challenge compared to its multiple-choice counterpart. This is because you not only have to come up with a reasonable answer but also one that matches the ground truth as closely as possible.

Example of a WebQA question, showing all possible sources to choose from. The correct ones are indicated by the green check mark.

The figure above shows all the elements we just discussed. Given many potential sources, the task involves identifying the correct ones, followed by a process of reasoning and ultimately formulating an answer based on the information gathered.

Why is it hard (and important)?

At first glance, VQA sounds a lot simpler than chess: we look at texts and images and have to answer some questions. As a human, one would first identify relevant sources (retrieval stage) and then combine the knowledge from those sources to answer the question (question-answering stage). While indeed humans can do those tasks relatively well, LLMs struggle due to multiple reasons:

  1. Lack of information: Although trained on an impressive amount of text, for complex questions the answer probably cannot be found internally. Thus, the LLM needs access to an outside database since in its plain form it only can rely on its “internal“ knowledge.
  2. Understanding images: Paired with our first point, plain LLMs are not capable of interpreting and loading images. There are now more capable multimodal models, however, they are still a work in progress. Integrating different modalities is challenging since it brings even more considerations regarding how you train the models, how you evaluate the models etc.

Upon overcoming those challenges, we will have models adept at comprehending both text and images and capable of reasoning over them. With such capabilities, we can harness these powerful machines to automate tasks and provide support across various domains.

Approach

Now, let’s consolidate all aspects and components, bringing the LLM as VQA Agent to life! From a high level, the structure mirrors the classic ML approach: feed some inputs into an algorithm and expect some output.

In our scenario, this black box is the fusion of the LLM acting as an agent within the MCTS environment we defined. Here, MCTS dynamically constructs a tree structure based on the actions undertaken by the LLM, thereby generating new states that either add new information or lead to a conclusive answer to the overarching question. Notably, there can be multiple “Answer“ states, which are called terminal states, however, the one with the highest total reward is chosen when the algorithm stops.

The reward itself is determined by the LLM, which evaluates the utility of the action within the current state. While this method is straightforward and cost-effective, it is suboptimal and presents an opportunity for improvement.

The retrieval system, so to speak, functions as the toolset for tasks beyond the capabilities of the LLM. Our retrieval system comprises a standard vector database, facilitated by an embedder and a vision-language model to translate information from the image to the text domain. Conversely, answering entails a purely textual task, requiring no external tool.

The intuition behind this approach lies in the LLM’s ability to make decisions regarding two key aspects:

1) Assessing the quality of a state, or the contextual relevance to answer a question.
2) Evaluating whether additional information is required, identifying missing pieces, and formulating queries to address these gaps.

This incremental nature aligns seamlessly with the multi-hop requirement of WebQA, where multiple pieces of information may be needed to fully address a question.

(Simplified) Overview: in between the inputs and the output we have our algorithm using MCTS for the environment dynamics and the LLM as an agent, making decisions about and executing actions at each decision point alias node.

As components, we have the agent, represented by the LLM, along with additional models as tools that provide relevant evidence. The green and blue colors indicate which model is active at each step. In our core setup, the actions available are “Retrieve“ and “Answer“; at each node, the model can choose a fixed number of actions (in our example, two actions). The “Answer“ action marks a terminal state, indicating that the tree search stops at that point. As depicted, there are multiple terminal states (“Answer“ nodes) in the tree. The winning node is the terminal state with the highest total reward, which utilizes the reward to estimate the best possible sequence of actions. It is essential to note that this is a simplified case where only two types of actions are considered; however, one has the freedom to design any actions they deem appropriate for the task

Once again, let’s discuss actions within this framework. Each action leads to a new state. In the case of “Answer“, the model is prompted to provide a final solution to the question based on the evidence collected thus far. For “Retrieve“, the process is more intricate: The reasoner formulates a textual query, which is then sent to the retriever. The retriever returns the highest-scoring evidence in text format. If the selected evidence includes images, a dedicated multimodal model extracts relevant information from those images. Since “Retrieve“ is not a terminal node, the algorithm continues by asking the reasoner to select actions from the available set of actions.

As one can see, the model is free to gather evidence it perceives as lacking, thereby iteratively constructing a working memory. Simultaneously, the model explores multiple potential paths, rendering it more failure-tolerant compared to, for instance, Chain-of-Thought. In Chain-of-Thought, once derailed, recovery from accumulated mistakes is typically challenging.

Furthermore, this framework is highly adaptable, allowing for the straightforward definition of new actions, customization of reward structures, and exploration of various parameters such as those governing the tree search. For instance, you can adjust the number of actions selected per step, thereby balancing between speed and exploration.

Results and Findings

We investigated how well our approach works when compared against methods that are fine-tuned and specifically tailored only for WebQA. We examined the effectiveness of our approach by comparing it against methods fine-tuned and explicitly tailored for WebQA. Our approach demonstrates comparable performance while offering greater flexibility and ease of investigation. This is because we can analyze all actions and their associated rewards, enabling us to pinpoint where the LLM might have taken the wrong turn and make adjustments accordingly.

The integration of MCTS with LLMs in VQA represents a novel approach that offers several advantages:

1. Flexibility: Our framework can adapt to various question types and visual contexts, showcasing broad applicability across different VQA scenarios.

2. Extensibility: The architecture of our system is designed for scalability, facilitating seamless integration of additional modules or updates as the field progresses.

3. Robust Decision-Making: Through the use of MCTS, our approach excels in navigating complex decision spaces, enabling more nuanced and contextually appropriate responses. In contrast to CoT, MCTS allows the LLM to correct itself, which is not possible for CoT.

4. Generalizability: Unlike models that necessitate extensive fine-tuning on specific datasets, our framework maintains a degree of generalizability, performing well across diverse datasets without extensive dataset-specific optimization.

However, we also realized that despite our advancements, there are persisting challenges. Error case propagation, for instance, occurs when a flawed query yields unreliable sources, potentially leading to erroneous conclusions. Despite our progress, current models continue to struggle with accurately interpreting instructions. Moreover, hallucinations and overconfidence persist as challenges, potentially skewing the accuracy of generated responses.

Broader Picture: Future Uses of LLMs

Our framework’s performance not only showcases the flexibility of Large Language Models (LLMs) but also underscores their potential to serve as foundational tools for a wide range of tasks. The ability of our framework to adapt and integrate within the scope of VQA is a testament to the versatility of LLMs. The integration of MCTS with LLMs represents a significant step towards harnessing the computational and contextual strengths of these models, demonstrating that the power of LLMs extends well beyond mere language processing tasks. Such advancements will pave the way for more sophisticated and real-life use cases, such as interactive robots, advanced medical diagnostics, and enhanced virtual assistants.

We anticipate the emergence of more sophisticated algorithms that enhance the applicability of LLMs in production environments. The implications for real-world applications are vast and varied. From improving accessibility technology to creating more immersive educational tools, the potential is boundless.

If you are interested in collaborating on use cases or exploring how LLMs can used to improve your business, feel free to reach out to us at inovex!

In conclusion, our framework is a precursor to a broader adoption and refinement of LLMs across tasks and modalities. It is a clear indication that as we move forward, the integration of complex algorithms with LLMs will not only be common but also essential for creating versatile, efficient, and effective AI systems. We stand on the brink of a new era in AI, characterized by a profound and intuitive understanding of the world around us, driven by the remarkable advancements in LLMs.