Combining and Quantizing LLMs for Enhanced Performance

TL;DR:

This paper investigates strategies to improve the performance of Large Language Models (LLMs) by combining them and applying quantisation methods. Four open-source LLMs (Llama 3.1, Phi-3.5-mini, Mistral v0.3, Zephyr Beta) were evaluated on different tasks such as multi-label classification, sentiment classification, question-answering and summarisation. It was found that simpler consensus strategies such as voting and weighting outperform more complex approaches such as multi-agent debate, especially in terms of computational intensity and consensus speed.

The results show that the combination of LLMs can significantly improve performance in certain tasks such as multi-label classification and summarisation. In addition, the effectiveness of the quantisation methods AWQ and GPTQ was investigated. AWQ was shown to be more favourable for short predictions, while GPTQ can offer advantages for longer predictions, even though AWQ provides slightly better results overall.

Large Language Models (LLMs) are increasingly being used in automated environments to solve a wide range of tasks, such as categorizing customer feedback, summarizing lengthy research papers, and answering technical support queries.
In this article, we test different strategies for boosting performance by combining LLMs.

LLMs are increasingly employed to solve a variety of tasks, such as categorizing customer feedback, summarizing lengthy research papers, and answering technical support queries, surpassing traditional methods in several areas. While conventional approaches can be resource-intensive and prone to errors, LLMs offer a more efficient alternative. However, despite their impressive performance, there is still potential for further enhancement. To maximize their capabilities, we test different strategies for boosting performance by combining LLMs. We do so by optimizing diverse methods to select the representative prediction from the models without altering them, together with exploring the effects of quantization.

Evaluation Setup

Which LLMs and Datasets Are Used?

We select four open-source LLMs, depending on their popularity and their size, to evaluate the combination. Here are the models we are using:

We also want to cover as much ground as possible by evaluating the different strategies for boosting the resulting performance. This is why we decided on these four different task types and respective data sets:

Multi-Label Classification (CARER)
Sentiment Classification on a collection of Tweets (sadness, joy, love, anger, fear, surprise)
Binary Classification (SST-2)
Sentiment Classification on a collection of movie reviews (positive, negative)
Question Answering (SQuAD)
Answering questions on Wikipedia articles by providing segments or spans from the reading passage
Summarization (SAMSum)
Dialogue summarization on synthetic messenger-like conversations

Which Strategies Are Used for the Combination of LLMs?

To achieve a common consensus within the combination of LLMs, different strategies are tested. Depending on the task type, the approaches can change fundamentally.
Moreover, the strategies differ in their complexity. On the one hand, there are strategies that are straight-forward and can reach consensus after one or two rounds. On the other hand, more complex strategies reach an agreement after more than ten rounds and thus come with a respectively increased runtime. Ultimately, the best strategy will be used for the final evaluation for boosting performance by combining LLMs.

Voting

In this strategy, we first hold a majority vote among the models in the combination. If there is a tie, we add up the weights of the models/metrics for identical answers, and the prediction with the highest weight is selected. More information about the assigned weights can be found in the section about Weighting. If there is still a tie, we choose the final prediction randomly. This strategy is applied to the task types Classification and Question Answering.

The following diagram depicts a simple voting process example in the Multi-Label Classification scenario. On the left-hand side, the predictions of the four models, together with their respective weights, are shown. To determine the representative answer, we simply count which class has been predicted the most. In this case, the class Joy received two votes, while the classes Sadness and Fear received only one. Therefore, Joy gets selected as the common consensus.

Diagram displaying four predictions and weights of four models in the combination of LLMs, following the Voting strategy for reaching consensus, where the majority prediction is selected, in the Multi-Label Classification task type. The prediction of Joy is made twice, while Sadness and Fear are made only once, therefore resulting in a consensus of Joy.

Consensus Strategy Voting for the Multi-Label Classification task, where the majority prediction is selected as common consensus.

Weighting

In contrast to the Voting strategy, the difference of the Weighting strategy is that there is no initial majority vote. For the same answers, we just add up the weights of the models/metrics, and the prediction with the highest weight wins. If there is still a tie, we choose the final prediction randomly. This strategy is applied to the task types Classification, Question Answering, and Summarization.

We assign weights differently for the various task types. For example, we assign weights to the LLMs themselves for Classification tasks, since the range of possible answers is set, and we want to determine the importance of each model when contributing to the combination. For Question Answering and Summarization tasks, we define and weight metrics, such as Cosine Similarity, that quantitatively capture the provided responses, because the content of the predictions can be arbitrary, and we want to be able to determine which of the given answers is the most representative. We then optimize and normalize the weights on hold-out validation data for each applicable strategy and determine which one yields the best results.

The following diagram shows the same setting as above, however, now using the Weighting strategy. Again, the predictions and weights of the four models are shown. To reach consensus, we now sum up the weights for identical predictions. Here, we find that Sadness yields a weight of 0.25, and even though Joy gets predicted twice, the summed weight value of 0.35 does not exceed the weight value for Fear with 0.40. For this reason, Fear gets selected as the common consensus.

Diagram displaying four predictions and weights of four models in the combination of LLMs, following the Weighting strategy for reaching consensus, where the class with the highest summed weight is selected, in the Multi-Label Classification task type. The prediction of Fear holds the highest weight, while Joy and Sadness hold lower summed weight values, therefore resulting in a consensus of Fear.

Consensus Strategy Weighting for the Multi-Label Classification task, where the class with the highest summed weight is selected as common consensus.

Second Run on Ties

The “pure“ Voting and Weighting strategy is expanded for Classification tasks by having a second prediction round on entries where there is a tie. This means that we requeue the same entries for LLM processing. For the Multi-Label Classification, we also make a distinction between two input variants. In one variant, all classes are available to the LLMs in the second prediction round. In the other variant, only the classes predicted on this entry are available to the LLMs, thus limiting the selection of classes. This strategy is applied to the task type Classification.

Confidence Filtering

In addition, the strategies can be expanded with confidence filtering as an alternative. With this, LLMs don’t just give their predictions; they also include a confidence level that shows how sure they are about their answers. This means that the LLMs return a normalized confidence score for each of their answers, which each model determines and generates itself. Then, uncertain predictions are filtered out. This way, we reach consensus on a reduced set. This strategy is applied to the task types Classification, Question Answering, and Summarization.

Multi-Agent Debate (MAD)

In this strategy, LLMs also give reasons why they think their answer is correct. Then, in a randomly ordered and anonymous manner, all LLMs in the combination exchange their predictions and reasoning. This means that each model receives three answer-and-reasoning pairs. Afterwards, based on the new info, the models generate a new prediction and reasoning in the next round (note that they can just decide to stick with their previous answers). This process is repeated until they reach a 90% consensus (meaning that on 90% of entries in a dataset, all models agree on the same answer), or no consensus increase is detected in a following round. This strategy is applied to the task type Classification. The following diagram displays the process described above.

Diagram displaying one model on the left and the other models in the combination of LLMs on the right, following the Multi-Agent Debate strategy for reaching consensus, where the models exchange reasonings for their predictions and generate new answers until all of them agree. The diagram shows the flow of the exchange and generation of new predictions in the course of time.

Consensus Strategy Multi-Agent Debate, where the models find a consensus based on their exchanged predictions and reasonings over multiple rounds.

Consensus Results When Combining LLMs

Did We Succeed in Boosting Performance by Combining LLMs?

From our evaluation, we can conclude that boosting the performance by combining LLMs highly depends on the task. For example, in tasks like Multi-Label Classification and Summarization, we observed a noticeable improvement when multiple LLMs were used together. In some metrics, we even detected significant improvements compared to the best individual model in the combination. All in all, this definitely suggests that combining models can lead to better results in certain tasks compared to relying on just one.

Task Type	Metric	Best Combination	Best Single Model
Multi-Label Classification	F1-Score	0.5906	0.5777
Binary Classification	F1-Score	0.9239	0.9252
Question Answering	Exact Match	0.5860	0.6341
Summarization	ROUGE-L	0.3494	0.3464

Overview of the evaluation results regarding the most important metric for each task type. For the Multi-Label Classification and Summarization, the best combination was reached through the pure Voting strategy; for the Binary Classification and Question Answering it was reached through the pure Weighting strategy.

The reasons for the decrease in performance in some tasks can be diverse. A possible explanation for this phenomenon could be that, due to increased noise and complexity, as well as potential overfitting, which together contribute to inconsistent outcomes, wrong final answers are given. Another likely reason stems from the models having different foci across the combination, thus automatically yielding worse results and hindering better models from performing at their maximum. Furthermore, consensus strategies like Voting can lead to worse performance if LLMs that give the same wrong prediction prevail. To further boost the performance, especially in tasks where the combination did not outperform the best single model, there are various promising steps to take.

Possible approaches could include covering an even broader spectrum of consensus-building techniques to select optimal responses, utilizing diverse model constellations to enhance response variance and creativity, and experimenting with various shot configurations and prompts to optimize task handling and evaluation outcome.

Are Less Complex Strategies Beneficial?

Our evaluation reveals that simpler consensus strategies outperform the Multi-Agent Debate approach. Although the evaluation in this context focuses solely on Classification datasets, methods like Voting and Weighting achieve better results. Additionally, they are less computationally intensive and reach consensus more quickly.

Task Type	Metric	Best Strategy	MAD Approach
Multi-Label Classification	F1-Score	0.5906	0.5803
Binary Classification	F1-Score	0.9239	0.9208

Comparison of the best consensus strategy and the MAD approach in terms of the most important metric for both Classification tasks. For the Multi-Label Classification, the best strategy was the pure Voting strategy; for the Binary Classification it was the pure Weighting strategy.

LLM Quantization Methods

In addition to the pure evaluation of the strategies, our goal is to use two different and widely used quantization methods for LLMs, AWQ and GPTQ. Then, based on the results, we’ll make recommendations for the quantization methods depending on the scenario and their benefits.

In general, LLM Quantization is a technique used to make LLMs faster and more efficient without compromising too much on their accuracy. Essentially, it involves reducing the precision of the parameters/weights the model uses to represent data. Think of it like rounding off decimals to make calculations quicker. Therefore, this process helps models run faster and use less memory, which is especially important when dealing with large datasets and limited hardware resources. Different quantization methods can be used depending on the task, balancing performance and efficiency to optimize the model’s speed and overall performance.

In our evaluation, we use post-training quantization approaches. As the name suggests, post-training quantization is a technique used to quantize a model after it has already been trained.

AWQ

AWQ (Activation-Aware Weight Quantization) is a smart way of making a model faster and more efficient by reducing the size of its data. However, it carefully does this by following some principles:

Activation-Aware
Traditional quantization methods treat weights and activations independently, meaning they don’t consider how the activations behave when adjusting the weights. In AWQ, the quantization process is aware of the activation patterns — basically, it looks at how data flows through the model and adjusts the weights accordingly.
Optimizing Precision
AWQ doesn’t just reduce the precision of the weights randomly. Instead, it uses a strategy that considers how much precision is needed in each weight based on the activations it will interact with during processing for each channel. In general, more important weights (in regard to activations) get „protected“ by applying smaller scaling factors. By doing so, the quantization process retains as much useful information as possible while still benefiting from the speed and memory-saving advantages of reduced precision.

Regarding the aforementioned points, AWQ can be especially beneficial in large-scale applications, like when deploying LLMs to run on resource-constrained devices or cloud environments, where speed and memory efficiency are critical.

GPTQ

GPTQ (Generalized Post-Training Quantization) is an optimization technique that applies a gradient-based approach to minimize the loss of information during the quantization of weights. Here’s a breakdown of how GPTQ works:

Optimization of Quantization Error
The core of GPTQ is its focus on reducing the quantization error. When you reduce the precision of a model’s weights, you lose some details, which can lead to inaccuracies. GPTQ works by minimizing the loss of important information step by step. In other words, GPTQ carefully decides how to lower precision in a way that causes the least harm to the model’s overall performance.
Batch Quantization
To optimize efficiency, GPTQ processes weights in batches to handle multiple weights at once, reducing the overall computation time. Furthermore, instead of quantizing the entire model at once, GPTQ applies layer-wise compression. Each layer’s weights are analyzed individually, taking into account their specific weight distributions. This tailored approach ensures that each layer is quantized in the most efficient way, preserving the model’s performance across different layers.

Through these techniques, GPTQ optimizes the quantization process by minimizing error, improving efficiency, and maintaining model accuracy. This makes it a highly effective approach for compressing LLMs without significant performance loss.

Quantization Results When Combining LLMs

Which Quantization Method Is Better?

Overall, AWQ performs slightly better than GPTQ, although again, performance depends heavily on the task. AWQ models provide significantly faster inference for short predictions and generally yield better results. However, GPTQ models sometimes offer significantly faster inference for long predictions, but they tend to deliver slightly poorer overall results in most cases.

Task Type	Metric	Better Method	Significant
Multi-Label Classification	Accuracy	AWQ	✗
Multi-Label Classification	Ø Inference Time	AWQ	✓
Binary Classification	Accuracy	GPTQ	✗
Binary Classification	Ø Inference Time	AWQ	✓
Question Answering	Ø Performance	AWQ	✗
Question Answering	Ø Inference Time	GPTQ	✓
Summarization	Ø Performance	AWQ	✗
Summarization	Ø Inference Time	GPTQ	✗

Comparison of the two quantization methods used (AWQ and GPTQ) in terms of the most important metrics, together with an indication of whether the differences are detected as significant, for each task type.

Summary and Outlook

All in all, the results have shown to be beneficial in certain contexts, effectively boosting performance by combining LLMs. Particularly in tasks like Multi-Label Classification and Summarization, the combination overall outperformed the best single model in several metrics.
We also saw that simpler consensus-building methods can often surpass more complex approaches like MAD in terms of resulting performance. The reduced inference time of simpler methods, which can reach consensus within one round, makes them particularly effective and beneficial in a wide range of scenarios.
On top of that, the analysis of quantization methods revealed that AWQ models generally offer a better balance between efficiency and performance compared to GPTQ models. We saw that the AWQ approach is particularly advantageous for tasks that require short responses, such as Classification. However, GPTQ may be preferred for tasks needing longer answers, like Summarization.

Moving forward, exploring additional datasets and task types can further validate these findings and potentially uncover new patterns. Enhancing model diversity and experimenting with different quantization techniques could lead to even greater performance improvements. Moreover, user tests in practical applications can provide valuable feedback on the real-world applicability and benefits of these approaches.

Sources

Generative AI Training für Geschäftsleute

Das Training vermittelt einen umfassenden Überblick über die wichtigsten Anwendungsfälle von Generativer AI, deren grundlegende Struktur und die daraus entstehenden Anforderungen für den Arbeitsalltag in Theorie und Praxis. Auf Wunsch können Sie KI-Kompetenz nach Artikel 4 / EU AI Act aufbauen.

Zum Training

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Combining and Quantizing LLMs for Enhanced Performance