TL;DR:
This paper investigates strategies to improve the performance of Large Language Models (LLMs) by combining them and applying quantisation methods. Four open-source LLMs (Llama 3.1, Phi-3.5-mini, Mistral v0.3, Zephyr Beta) were evaluated on different tasks such as multi-label classification, sentiment classification, question-answering and summarisation. It was found that simpler consensus strategies such as voting and weighting outperform more complex approaches such as multi-agent debate, especially in terms of computational intensity and consensus speed.
The results show that the combination of LLMs can significantly improve performance in certain tasks such as multi-label classification and summarisation. In addition, the effectiveness of the quantisation methods AWQ and GPTQ was investigated. AWQ was shown to be more favourable for short predictions, while GPTQ can offer advantages for longer predictions, even though AWQ provides slightly better results overall.
Large Language Models (LLMs) are increasingly being used in automated environments to solve a wide range of tasks, such as categorizing customer feedback, summarizing lengthy research papers, and answering technical support queries.
In this article, we test different strategies for boosting performance by combining LLMs.
LLMs are increasingly employed to solve a variety of tasks, such as categorizing customer feedback, summarizing lengthy research papers, and answering technical support queries, surpassing traditional methods in several areas. While conventional approaches can be resource-intensive and prone to errors, LLMs offer a more efficient alternative. However, despite their impressive performance, there is still potential for further enhancement. To maximize their capabilities, we test different strategies for boosting performance by combining LLMs. We do so by optimizing diverse methods to select the representative prediction from the models without altering them, together with exploring the effects of quantization.
Evaluation Setup
Which LLMs and Datasets Are Used?
We select four open-source LLMs, depending on their popularity and their size, to evaluate the combination. Here are the models we are using:
- Meta Llama 3.1 (8B)
- Microsoft Phi-3.5-mini (3.8B)
- Mistral v0.3 (7B)
- Hugging Face Zephyr Beta (7B)
We also want to cover as much ground as possible by evaluating the different strategies for boosting the resulting performance. This is why we decided on these four different task types and respective data sets:
- Multi-Label Classification (CARER)
Sentiment Classification on a collection of Tweets (sadness, joy, love, anger, fear, surprise) - Binary Classification (SST-2)
Sentiment Classification on a collection of movie reviews (positive, negative) - Question Answering (SQuAD)
Answering questions on Wikipedia articles by providing segments or spans from the reading passage - Summarization (SAMSum)
Dialogue summarization on synthetic messenger-like conversations
Which Strategies Are Used for the Combination of LLMs?
To achieve a common consensus within the combination of LLMs, different strategies are tested. Depending on the task type, the approaches can change fundamentally.
Moreover, the strategies differ in their complexity. On the one hand, there are strategies that are straight-forward and can reach consensus after one or two rounds. On the other hand, more complex strategies reach an agreement after more than ten rounds and thus come with a respectively increased runtime. Ultimately, the best strategy will be used for the final evaluation for boosting performance by combining LLMs.
Voting
In this strategy, we first hold a majority vote among the models in the combination. If there is a tie, we add up the weights of the models/metrics for identical answers, and the prediction with the highest weight is selected. More information about the assigned weights can be found in the section about Weighting. If there is still a tie, we choose the final prediction randomly. This strategy is applied to the task types Classification and Question Answering.
The following diagram depicts a simple voting process example in the Multi-Label Classification scenario. On the left-hand side, the predictions of the four models, together with their respective weights, are shown. To determine the representative answer, we simply count which class has been predicted the most. In this case, the class Joy received two votes, while the classes Sadness and Fear received only one. Therefore, Joy gets selected as the common consensus.
Weighting
In contrast to the Voting strategy, the difference of the Weighting strategy is that there is no initial majority vote. For the same answers, we just add up the weights of the models/metrics, and the prediction with the highest weight wins. If there is still a tie, we choose the final prediction randomly. This strategy is applied to the task types Classification, Question Answering, and Summarization.
We assign weights differently for the various task types. For example, we assign weights to the LLMs themselves for Classification tasks, since the range of possible answers is set, and we want to determine the importance of each model when contributing to the combination. For Question Answering and Summarization tasks, we define and weight metrics, such as Cosine Similarity, that quantitatively capture the provided responses, because the content of the predictions can be arbitrary, and we want to be able to determine which of the given answers is the most representative. We then optimize and normalize the weights on hold-out validation data for each applicable strategy and determine which one yields the best results.
The following diagram shows the same setting as above, however, now using the Weighting strategy. Again, the predictions and weights of the four models are shown. To reach consensus, we now sum up the weights for identical predictions. Here, we find that Sadness yields a weight of 0.25, and even though Joy gets predicted twice, the summed weight value of 0.35 does not exceed the weight value for Fear with 0.40. For this reason, Fear gets selected as the common consensus.
Second Run on Ties
The “pure“ Voting and Weighting strategy is expanded for Classification tasks by having a second prediction round on entries where there is a tie. This means that we requeue the same entries for LLM processing. For the Multi-Label Classification, we also make a distinction between two input variants. In one variant, all classes are available to the LLMs in the second prediction round. In the other variant, only the classes predicted on this entry are available to the LLMs, thus limiting the selection of classes. This strategy is applied to the task type Classification.
Confidence Filtering
In addition, the strategies can be expanded with confidence filtering as an alternative. With this, LLMs don’t just give their predictions; they also include a confidence level that shows how sure they are about their answers. This means that the LLMs return a normalized confidence score for each of their answers, which each model determines and generates itself. Then, uncertain predictions are filtered out. This way, we reach consensus on a reduced set. This strategy is applied to the task types Classification, Question Answering, and Summarization.
Multi-Agent Debate (MAD)
In this strategy, LLMs also give reasons why they think their answer is correct. Then, in a randomly ordered and anonymous manner, all LLMs in the combination exchange their predictions and reasoning. This means that each model receives three answer-and-reasoning pairs. Afterwards, based on the new info, the models generate a new prediction and reasoning in the next round (note that they can just decide to stick with their previous answers). This process is repeated until they reach a 90% consensus (meaning that on 90% of entries in a dataset, all models agree on the same answer), or no consensus increase is detected in a following round. This strategy is applied to the task type Classification. The following diagram displays the process described above.
Consensus Results When Combining LLMs
Did We Succeed in Boosting Performance by Combining LLMs?
From our evaluation, we can conclude that boosting the performance by combining LLMs highly depends on the task. For example, in tasks like Multi-Label Classification and Summarization, we observed a noticeable improvement when multiple LLMs were used together. In some metrics, we even detected significant improvements compared to the best individual model in the combination. All in all, this definitely suggests that combining models can lead to better results in certain tasks compared to relying on just one.
Task Type | Metric | Best Combination | Best Single Model |
---|---|---|---|
Multi-Label Classification | F1-Score | 0.5906 | 0.5777 |
Binary Classification | F1-Score | 0.9239 | 0.9252 |
Question Answering | Exact Match | 0.5860 | 0.6341 |
Summarization | ROUGE-L | 0.3494 | 0.3464 |
The reasons for the decrease in performance in some tasks can be diverse. A possible explanation for this phenomenon could be that, due to increased noise and complexity, as well as potential overfitting, which together contribute to inconsistent outcomes, wrong final answers are given. Another likely reason stems from the models having different foci across the combination, thus automatically yielding worse results and hindering better models from performing at their maximum. Furthermore, consensus strategies like Voting can lead to worse performance if LLMs that give the same wrong prediction prevail. To further boost the performance, especially in tasks where the combination did not outperform the best single model, there are various promising steps to take.
Possible approaches could include covering an even broader spectrum of consensus-building techniques to select optimal responses, utilizing diverse model constellations to enhance response variance and creativity, and experimenting with various shot configurations and prompts to optimize task handling and evaluation outcome.
Are Less Complex Strategies Beneficial?
Our evaluation reveals that simpler consensus strategies outperform the Multi-Agent Debate approach. Although the evaluation in this context focuses solely on Classification datasets, methods like Voting and Weighting achieve better results. Additionally, they are less computationally intensive and reach consensus more quickly.
Task Type | Metric | Best Strategy | MAD Approach |
---|---|---|---|
Multi-Label Classification | F1-Score | 0.5906 | 0.5803 |
Binary Classification | F1-Score | 0.9239 | 0.9208 |
LLM Quantization Methods
In addition to the pure evaluation of the strategies, our goal is to use two different and widely used quantization methods for LLMs, AWQ and GPTQ. Then, based on the results, we’ll make recommendations for the quantization methods depending on the scenario and their benefits.
In general, LLM Quantization is a technique used to make LLMs faster and more efficient without compromising too much on their accuracy. Essentially, it involves reducing the precision of the parameters/weights the model uses to represent data. Think of it like rounding off decimals to make calculations quicker. Therefore, this process helps models run faster and use less memory, which is especially important when dealing with large datasets and limited hardware resources. Different quantization methods can be used depending on the task, balancing performance and efficiency to optimize the model’s speed and overall performance.
In our evaluation, we use post-training quantization approaches. As the name suggests, post-training quantization is a technique used to quantize a model after it has already been trained.
AWQ
AWQ (Activation-Aware Weight Quantization) is a smart way of making a model faster and more efficient by reducing the size of its data. However, it carefully does this by following some principles:
- Activation-Aware
Traditional quantization methods treat weights and activations independently, meaning they don’t consider how the activations behave when adjusting the weights. In AWQ, the quantization process is aware of the activation patterns — basically, it looks at how data flows through the model and adjusts the weights accordingly. - Optimizing Precision
AWQ doesn’t just reduce the precision of the weights randomly. Instead, it uses a strategy that considers how much precision is needed in each weight based on the activations it will interact with during processing for each channel. In general, more important weights (in regard to activations) get „protected“ by applying smaller scaling factors. By doing so, the quantization process retains as much useful information as possible while still benefiting from the speed and memory-saving advantages of reduced precision.
Regarding the aforementioned points, AWQ can be especially beneficial in large-scale applications, like when deploying LLMs to run on resource-constrained devices or cloud environments, where speed and memory efficiency are critical.
GPTQ
GPTQ (Generalized Post-Training Quantization) is an optimization technique that applies a gradient-based approach to minimize the loss of information during the quantization of weights. Here’s a breakdown of how GPTQ works:
- Optimization of Quantization Error
The core of GPTQ is its focus on reducing the quantization error. When you reduce the precision of a model’s weights, you lose some details, which can lead to inaccuracies. GPTQ works by minimizing the loss of important information step by step. In other words, GPTQ carefully decides how to lower precision in a way that causes the least harm to the model’s overall performance. - Batch Quantization
To optimize efficiency, GPTQ processes weights in batches to handle multiple weights at once, reducing the overall computation time. Furthermore, instead of quantizing the entire model at once, GPTQ applies layer-wise compression. Each layer’s weights are analyzed individually, taking into account their specific weight distributions. This tailored approach ensures that each layer is quantized in the most efficient way, preserving the model’s performance across different layers.
Through these techniques, GPTQ optimizes the quantization process by minimizing error, improving efficiency, and maintaining model accuracy. This makes it a highly effective approach for compressing LLMs without significant performance loss.
Quantization Results When Combining LLMs
Which Quantization Method Is Better?
Overall, AWQ performs slightly better than GPTQ, although again, performance depends heavily on the task. AWQ models provide significantly faster inference for short predictions and generally yield better results. However, GPTQ models sometimes offer significantly faster inference for long predictions, but they tend to deliver slightly poorer overall results in most cases.
Task Type | Metric | Better Method | Significant |
---|---|---|---|
Multi-Label Classification | Accuracy | AWQ | ✗ |
Ø Inference Time | AWQ | ✓ |
|
Binary Classification | Accuracy | GPTQ | ✗ |
Ø Inference Time | AWQ | ✓ |
|
Question Answering | Ø Performance | AWQ | ✗ |
Ø Inference Time | GPTQ | ✓ |
|
Summarization | Ø Performance | AWQ | ✗ |
Ø Inference Time | GPTQ | ✗ |
Summary and Outlook
All in all, the results have shown to be beneficial in certain contexts, effectively boosting performance by combining LLMs. Particularly in tasks like Multi-Label Classification and Summarization, the combination overall outperformed the best single model in several metrics.
We also saw that simpler consensus-building methods can often surpass more complex approaches like MAD in terms of resulting performance. The reduced inference time of simpler methods, which can reach consensus within one round, makes them particularly effective and beneficial in a wide range of scenarios.
On top of that, the analysis of quantization methods revealed that AWQ models generally offer a better balance between efficiency and performance compared to GPTQ models. We saw that the AWQ approach is particularly advantageous for tasks that require short responses, such as Classification. However, GPTQ may be preferred for tasks needing longer answers, like Summarization.
Moving forward, exploring additional datasets and task types can further validate these findings and potentially uncover new patterns. Enhancing model diversity and experimenting with different quantization techniques could lead to even greater performance improvements. Moreover, user tests in practical applications can provide valuable feedback on the real-world applicability and benefits of these approaches.
Sources
- F. Gilardi, M. Alizadeh, and M. Kubli, 2023: ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks
- Y. Shen et al., 2023: HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
- A. Amirkhani and A. H. Barshooi, 2022: Consensus in Multi-Agent Systems: A Review
- Y. Du et al., 2023: Improving Factuality and Reasoning in Language Models through Multiagent Debate
- J. Lin et al., 2024: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- E. Frantar et al., 2023: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers