{"id":61342,"date":"2025-06-24T08:37:38","date_gmt":"2025-06-24T06:37:38","guid":{"rendered":"https:\/\/www.inovex.de\/?p=61342"},"modified":"2025-06-24T08:37:38","modified_gmt":"2025-06-24T06:37:38","slug":"combining-and-quantizing-llms-for-enhanced-performance","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/","title":{"rendered":"Combining and Quantizing LLMs for Enhanced Performance"},"content":{"rendered":"<p>Large Language Models (LLMs) are increasingly being used in automated environments to solve a wide range of tasks, such as categorizing customer feedback, summarizing lengthy research papers, and answering technical support queries.<br \/>\nIn this article, we test different strategies for boosting performance by combining LLMs.<br \/>\n<!--more--><\/p>\n<p>LLMs are increasingly employed to solve a variety of tasks, such as categorizing customer feedback, summarizing lengthy research papers, and answering technical support queries, surpassing traditional methods in several areas. While conventional approaches can be resource-intensive and prone to errors, LLMs offer a more efficient alternative. However, despite their impressive performance, there is still potential for further enhancement. To maximize their capabilities, we test different strategies for boosting performance by combining LLMs. We do so by optimizing diverse methods to select the representative prediction from the models without altering them, together with exploring the effects of quantization.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Evaluation-Setup\" >Evaluation Setup<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Which-LLMs-and-Datasets-Are-Used\" >Which LLMs and Datasets Are Used?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Which-Strategies-Are-Used-for-the-Combination-of-LLMs\" >Which Strategies Are Used for the Combination of LLMs?<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Voting\" >Voting<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Weighting\" >Weighting<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Second-Run-on-Ties\" >Second Run on Ties<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Confidence-Filtering\" >Confidence Filtering<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Multi-Agent-Debate-MAD\" >Multi-Agent Debate (MAD)<\/a><\/li><\/ul><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Consensus-Results-When-Combining-LLMs\" >Consensus Results When Combining LLMs<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Did-We-Succeed-in-Boosting-Performance-by-Combining-LLMs\" >Did We Succeed in Boosting Performance by Combining LLMs?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Are-Less-Complex-Strategies-Beneficial\" >Are Less Complex Strategies Beneficial?<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#LLM-Quantization-Methods\" >LLM Quantization Methods<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#AWQ\" >AWQ<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#GPTQ\" >GPTQ<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Quantization-Results-When-Combining-LLMs\" >Quantization Results When Combining LLMs<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Which-Quantization-Method-Is-Better\" >Which Quantization Method Is Better?<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Summary-and-Outlook\" >Summary and Outlook<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#Sources\" >Sources<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Evaluation-Setup\"><\/span>Evaluation Setup<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3><span class=\"ez-toc-section\" id=\"Which-LLMs-and-Datasets-Are-Used\"><\/span>Which LLMs and Datasets Are Used?<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>We select four open-source LLMs, depending on their popularity and their size, to evaluate the combination. Here are the models we are using:<\/p>\n<ul>\n<li><strong><a href=\"https:\/\/huggingface.co\/meta-llama\/Llama-3.1-8B-Instruct\">Meta Llama 3.1<\/a> (8B)<\/strong><\/li>\n<li><strong><a href=\"https:\/\/huggingface.co\/microsoft\/Phi-3.5-mini-instruct\">Microsoft Phi-3.5-mini<\/a> (3.8B)<\/strong><\/li>\n<li><strong><a href=\"https:\/\/huggingface.co\/mistralai\/Mistral-7B-Instruct-v0.3\">Mistral v0.3<\/a> (7B)<\/strong><\/li>\n<li><strong><a href=\"https:\/\/huggingface.co\/HuggingFaceH4\/zephyr-7b-beta\">Hugging Face Zephyr Beta<\/a> (7B)<\/strong><\/li>\n<\/ul>\n<p>We also want to cover as much ground as possible by evaluating the different strategies for boosting the resulting performance. This is why we decided on these four different task types and respective data sets:<\/p>\n<ul>\n<li><strong>Multi-Label Classification (<a href=\"https:\/\/huggingface.co\/datasets\/dair-ai\/emotion\">CARER<\/a>)<br \/>\n<\/strong>Sentiment Classification on a collection of Tweets (sadness, joy, love, anger, fear, surprise)<\/li>\n<li><strong>Binary Classification (<a href=\"https:\/\/huggingface.co\/datasets\/stanfordnlp\/sst2\">SST-2<\/a>)<br \/>\n<\/strong>Sentiment Classification on a collection of movie reviews (positive, negative)<\/li>\n<li><strong>Question Answering (<a href=\"https:\/\/huggingface.co\/datasets\/rajpurkar\/squad\">SQuAD<\/a>)<br \/>\n<\/strong>Answering questions on Wikipedia articles by providing segments or spans from the reading passage<\/li>\n<li><strong>Summarization (<a href=\"https:\/\/huggingface.co\/datasets\/Samsung\/samsum\">SAMSum<\/a>)<br \/>\n<\/strong>Dialogue summarization on synthetic messenger-like conversations<\/li>\n<\/ul>\n<h3><span class=\"ez-toc-section\" id=\"Which-Strategies-Are-Used-for-the-Combination-of-LLMs\"><\/span>Which Strategies Are Used for the Combination of LLMs?<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>To achieve a common consensus within the combination of LLMs, different strategies are tested. Depending on the task type, the approaches can change fundamentally.<br \/>\nMoreover, the strategies differ in their complexity. On the one hand, there are strategies that are straight-forward and can reach consensus after one or two rounds. On the other hand, more complex strategies reach an agreement after more than ten rounds and thus come with a respectively increased runtime. Ultimately, the best strategy will be used for the final evaluation for boosting performance by combining LLMs.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Voting\"><\/span>Voting<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>In this strategy, we first hold a majority vote among the models in the combination. If there is a tie, we add up the weights of the models\/metrics for identical answers, and the prediction with the highest weight is selected. More information about the assigned weights can be found in the section about <a href=\"#Weighting\">Weighting<\/a>. If there is still a tie, we choose the final prediction randomly. This strategy is applied to the task types Classification and Question Answering.<\/p>\n<p>The following diagram depicts a simple voting process example in the Multi-Label Classification scenario. On the left-hand side, the predictions of the four models, together with their respective weights, are shown. To determine the representative answer, we simply count which class has been predicted the most. In this case, the class Joy received two votes, while the classes Sadness and\u00a0Fear received only one. Therefore,\u00a0Joy gets selected as the common consensus.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-61443\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_voting.png\" alt=\"Diagram displaying four predictions and weights of four models in the combination of LLMs, following the Voting strategy for reaching consensus, where the majority prediction is selected, in the Multi-Label Classification task type. The prediction of Joy is made twice, while Sadness and Fear are made only once, therefore resulting in a consensus of Joy.\" width=\"1600\" height=\"400\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_voting.png 1600w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_voting-300x75.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_voting-1024x256.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_voting-768x192.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_voting-1536x384.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_voting-400x100.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_voting-360x90.png 360w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><\/p>\n<div style=\"text-align: center;\">Consensus Strategy Voting for the Multi-Label Classification task, where the majority prediction is selected as common consensus.<\/div>\n<h4><span class=\"ez-toc-section\" id=\"Weighting\"><\/span>Weighting<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>In contrast to the Voting strategy, the difference of the Weighting strategy is that there is no initial majority vote. For the same answers, we just add up the weights of the models\/metrics, and the prediction with the highest weight wins. If there is still a tie, we choose the final prediction randomly. This strategy is applied to the task types Classification, Question Answering, and Summarization.<\/p>\n<p>We assign weights differently for the various task types. For example, we assign weights to the LLMs themselves for Classification tasks, since the range of possible answers is set, and we want to determine the importance of each model when contributing to the combination. For Question Answering and Summarization tasks, we define and weight metrics, such as Cosine Similarity, that quantitatively capture the provided responses, because the content of the predictions can be arbitrary, and we want to be able to determine which of the given answers is the most representative. We then optimize and normalize the weights on hold-out validation data for each applicable strategy and determine which one yields the best results.<\/p>\n<p>The following diagram shows the same setting as above, however, now using the Weighting strategy. Again, the predictions and weights of the four models are shown. To reach consensus, we now sum up the weights for identical predictions. Here, we find that Sadness yields a weight of 0.25, and even though\u00a0Joy gets predicted twice, the summed weight value of 0.35 does not exceed the weight value for\u00a0Fear with 0.40. For this reason,\u00a0Fear gets selected as the common consensus.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-61447\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_weighting.png\" alt=\"Diagram displaying four predictions and weights of four models in the combination of LLMs, following the Weighting strategy for reaching consensus, where the class with the highest summed weight is selected, in the Multi-Label Classification task type. The prediction of Fear holds the highest weight, while Joy and Sadness hold lower summed weight values, therefore resulting in a consensus of Fear.\" width=\"1600\" height=\"400\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_weighting.png 1600w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_weighting-300x75.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_weighting-1024x256.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_weighting-768x192.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_weighting-1536x384.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_weighting-400x100.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_weighting-360x90.png 360w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><\/p>\n<div style=\"text-align: center;\">Consensus Strategy Weighting for the Multi-Label Classification task, where the class with the highest summed weight is selected as common consensus.<\/div>\n<h4><span class=\"ez-toc-section\" id=\"Second-Run-on-Ties\"><\/span>Second Run on Ties<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>The \u201cpure\u201c Voting and Weighting strategy is expanded for Classification tasks by having a second prediction round on entries where there is a tie. This means that we requeue the same entries for LLM processing. For the Multi-Label Classification, we also make a distinction between two input variants. In one variant, all classes are available to the LLMs in the second prediction round. In the other variant, only the classes predicted on this entry are available to the LLMs, thus limiting the selection of classes. This strategy is applied to the task type Classification.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Confidence-Filtering\"><\/span>Confidence Filtering<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>In addition, the strategies can be expanded with confidence filtering as an alternative. With this, LLMs don&#8217;t just give their predictions; they also include a confidence level that shows how sure they are about their answers. This means that the LLMs return a normalized confidence score for each of their answers, which each model determines and generates itself. Then, uncertain predictions are filtered out. This way, we reach consensus on a reduced set. This strategy is applied to the task types Classification, Question Answering, and Summarization.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Multi-Agent-Debate-MAD\"><\/span>Multi-Agent Debate (MAD)<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>In this strategy, LLMs also give reasons why they think their answer is correct. Then, in a randomly ordered and anonymous manner, all LLMs in the combination exchange their predictions and reasoning. This means that each model receives three answer-and-reasoning pairs. Afterwards, based on the new info, the models generate a new prediction and reasoning in the next round (note that they can just decide to stick with their previous answers). This process is repeated until they reach a 90% consensus (meaning that on 90% of entries in a dataset, all models agree on the same answer), or no consensus increase is detected in a following round. This strategy is applied to the task type Classification. The following diagram displays the process described above.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-61452\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_mad-1.png\" alt=\"Diagram displaying one model on the left and the other models in the combination of LLMs on the right, following the Multi-Agent Debate strategy for reaching consensus, where the models exchange reasonings for their predictions and generate new answers until all of them agree. The diagram shows the flow of the exchange and generation of new predictions in the course of time.\" width=\"1000\" height=\"500\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_mad-1.png 1000w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_mad-1-300x150.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_mad-1-768x384.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_mad-1-400x200.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/consensus_mad-1-360x180.png 360w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/p>\n<div style=\"text-align: center;\">Consensus Strategy Multi-Agent Debate, where the models find a consensus based on their exchanged predictions and reasonings over multiple rounds.<\/div>\n<h2><span class=\"ez-toc-section\" id=\"Consensus-Results-When-Combining-LLMs\"><\/span>Consensus Results When Combining LLMs<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3><span class=\"ez-toc-section\" id=\"Did-We-Succeed-in-Boosting-Performance-by-Combining-LLMs\"><\/span>Did We Succeed in Boosting Performance by Combining LLMs?<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>From our evaluation, we can conclude that boosting the performance by combining LLMs highly depends on the task. For example, in tasks like Multi-Label Classification and Summarization, we observed a noticeable improvement when multiple LLMs were used together. In some metrics, we even detected significant improvements compared to the best individual model in the combination. All in all, this definitely suggests that combining models can lead to better results in certain tasks compared to relying on just one.<\/p>\n\n<table id=\"tablepress-109\" class=\"tablepress tablepress-id-109\" aria-describedby=\"tablepress-109-description\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\"><div style=\"text-align: center\"><strong>Task Type<\/strong><\/div><\/th><th class=\"column-2\"><div style=\"text-align: center\"><strong>Metric<\/strong><\/div><\/th><th class=\"column-3\"><div style=\"text-align: center\"><strong>Best Combination<\/strong><\/div><\/th><th class=\"column-4\"><div style=\"text-align: center\"><strong>Best Single Model<\/strong><\/div><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"row-2\">\n\t<td class=\"column-1\"><div style=\"text-align: center\">Multi-Label Classification<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center\">F1-Score<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center\"><strong>0.5906<\/strong><\/div><\/td><td class=\"column-4\"><div style=\"text-align: center\"> 0.5777<\/div><\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\"><div style=\"text-align: center\">Binary Classification<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center\">F1-Score<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center\">0.9239<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center\"><strong>0.9252<\/strong><\/div><\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td class=\"column-1\"><div style=\"text-align: center\">Question Answering<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center\">Exact Match<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center\">0.5860<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center\"><strong>0.6341<\/strong><\/div><\/td>\n<\/tr>\n<tr class=\"row-5\">\n\t<td class=\"column-1\"><div style=\"text-align: center\">Summarization<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center\">ROUGE-L<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center\"><strong>0.3494<\/strong><\/div><\/td><td class=\"column-4\"><div style=\"text-align: center\">0.3464<\/div><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<span id=\"tablepress-109-description\" class=\"tablepress-table-description tablepress-table-description-id-109\"><div style=\"text-align:center\">Overview of the evaluation results regarding the most important metric for each task type. For the Multi-Label Classification and Summarization, the best combination was reached through the pure Voting strategy; for the Binary Classification and Question Answering it was reached through the pure Weighting strategy.<\/div><\/span>\n\n<p>The reasons for the decrease in performance in some tasks can be diverse. A possible explanation for this phenomenon could be that, due to increased noise and complexity, as well as potential overfitting, which together contribute to inconsistent outcomes, wrong final answers are given. Another likely reason stems from the models having different foci across the combination, thus automatically yielding worse results and hindering better models from performing at their maximum. Furthermore, consensus strategies like Voting can lead to worse performance if LLMs that give the same wrong prediction prevail. To further boost the performance, especially in tasks where the combination did not outperform the best single model, there are various promising steps to take.<\/p>\n<p>Possible approaches could include covering an even broader spectrum of consensus-building techniques to select optimal responses, utilizing diverse model constellations to enhance response variance and creativity, and experimenting with various shot configurations and prompts to optimize task handling and evaluation outcome.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Are-Less-Complex-Strategies-Beneficial\"><\/span>Are Less Complex Strategies Beneficial?<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Our evaluation reveals that simpler consensus strategies outperform the Multi-Agent Debate approach. Although the evaluation in this context focuses solely on Classification datasets, methods like Voting and Weighting achieve better results. Additionally, they are less computationally intensive and reach consensus more quickly.<\/p>\n\n<table id=\"tablepress-110\" class=\"tablepress tablepress-id-110\" aria-describedby=\"tablepress-110-description\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\"><div style=\"text-align: center\"><strong>Task Type<\/strong><\/div><\/th><th class=\"column-2\"><div style=\"text-align: center\"><strong>Metric<\/strong><\/div><\/th><th class=\"column-3\"><div style=\"text-align: center\"><strong>Best Strategy<\/strong><\/div><\/th><th class=\"column-4\"><div style=\"text-align: center\"><strong>MAD Approach<\/strong><\/div><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"row-2\">\n\t<td class=\"column-1\"><div style=\"text-align: center\">Multi-Label Classification<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center\">F1-Score<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center\"><strong>0.5906<\/strong><\/div><\/td><td class=\"column-4\"><div style=\"text-align: center\"> 0.5803<\/div><\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\"><div style=\"text-align: center\">Binary Classification<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center\">F1-Score<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center\"><strong>0.9239<\/strong><\/div><\/td><td class=\"column-4\"><div style=\"text-align: center\"> 0.9208<\/div><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<span id=\"tablepress-110-description\" class=\"tablepress-table-description tablepress-table-description-id-110\"><div style=\"text-align:center\">Comparison of the best consensus strategy and the MAD approach in terms of the most important metric for both Classification tasks. For the Multi-Label Classification, the best strategy was the pure Voting strategy; for the Binary Classification it was the pure Weighting strategy.<\/div><\/span>\n\n<h2><span class=\"ez-toc-section\" id=\"LLM-Quantization-Methods\"><\/span>LLM Quantization Methods<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In addition to the pure evaluation of the strategies, our goal is to use two different and widely used quantization methods for LLMs, AWQ and GPTQ. Then, based on the results, we&#8217;ll make recommendations for the quantization methods depending on the scenario and their benefits.<\/p>\n<p>In general, LLM Quantization is a technique used to make LLMs faster and more efficient without compromising too much on their accuracy. Essentially, it involves reducing the precision of the parameters\/weights the model uses to represent data. Think of it like rounding off decimals to make calculations quicker. Therefore, this process helps models run faster and use less memory, which is especially important when dealing with large datasets and limited hardware resources. Different quantization methods can be used depending on the task, balancing performance and efficiency to optimize the model&#8217;s speed and overall performance.<\/p>\n<p>In our evaluation, we use post-training quantization approaches. As the name suggests, post-training quantization is a technique used to quantize a model after it has already been trained.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"AWQ\"><\/span>AWQ<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>AWQ (Activation-Aware Weight Quantization) is a smart way of making a model faster and more efficient by reducing the size of its data. However, it carefully does this by following some principles:<\/p>\n<ul>\n<li><strong>Activation-Aware<br \/>\n<\/strong>Traditional quantization methods treat weights and activations independently, meaning they don&#8217;t consider how the activations behave when adjusting the weights. In AWQ, the quantization process is aware of the activation patterns \u2014 basically, it looks at how data flows through the model and adjusts the weights accordingly.<\/li>\n<li><strong>Optimizing Precision<br \/>\n<\/strong>AWQ doesn&#8217;t just reduce the precision of the weights randomly. Instead, it uses a strategy that considers how much precision is needed in each weight based on the activations it will interact with during processing for each channel. In general, more important weights (in regard to activations) get &#8222;protected&#8220; by applying smaller scaling factors. By doing so, the quantization process retains as much useful information as possible while still benefiting from the speed and memory-saving advantages of reduced precision.<\/li>\n<\/ul>\n<p>Regarding the aforementioned points, AWQ can be especially beneficial in large-scale applications, like when deploying LLMs to run on resource-constrained devices or cloud environments, where speed and memory efficiency are critical.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"GPTQ\"><\/span>GPTQ<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>GPTQ (Generalized Post-Training Quantization) is an optimization technique that applies a gradient-based approach to minimize the loss of information during the quantization of weights. Here&#8217;s a breakdown of how GPTQ works:<\/p>\n<ul>\n<li><strong>Optimization of Quantization Error<br \/>\n<\/strong>The core of GPTQ is its focus on reducing the quantization error. When you reduce the precision of a model&#8217;s weights, you lose some details, which can lead to inaccuracies. GPTQ works by minimizing the loss of important information step by step. In other words, GPTQ carefully decides how to lower precision in a way that causes the least harm to the model&#8217;s overall performance.<\/li>\n<li><strong>Batch Quantization<br \/>\n<\/strong>To optimize efficiency, GPTQ processes weights in batches to handle multiple weights at once, reducing the overall computation time. Furthermore, instead of quantizing the entire model at once, GPTQ applies layer-wise compression. Each layer&#8217;s weights are analyzed individually, taking into account their specific weight distributions. This tailored approach ensures that each layer is quantized in the most efficient way, preserving the model&#8217;s performance across different layers.<\/li>\n<\/ul>\n<p>Through these techniques, GPTQ optimizes the quantization process by minimizing error, improving efficiency, and maintaining model accuracy. This makes it a highly effective approach for compressing LLMs without significant performance loss.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Quantization-Results-When-Combining-LLMs\"><\/span>Quantization Results When Combining LLMs<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3><span class=\"ez-toc-section\" id=\"Which-Quantization-Method-Is-Better\"><\/span>Which Quantization Method Is Better?<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Overall, AWQ performs slightly better than GPTQ, although again, performance depends heavily on the task. AWQ models provide significantly faster inference for short predictions and generally yield better results. However, GPTQ models sometimes offer significantly faster inference for long predictions, but they tend to deliver slightly poorer overall results in most cases.<br \/>\n\n<table id=\"tablepress-108\" class=\"tablepress tablepress-id-108 tbody-has-connected-cells\" aria-describedby=\"tablepress-108-description\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\"><div style=\"text-align: center\"><strong>Task Type<\/strong><\/div><\/th><th class=\"column-2\"><div style=\"text-align: center\"><strong>Metric<\/strong><\/div><\/th><th class=\"column-3\"><div style=\"text-align: center\"><strong>Better Method<\/strong><\/div><\/th><th class=\"column-4\"><div style=\"text-align: center\"><strong>Significant<\/strong><\/div><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"row-2\">\n\t<td rowspan=\"2\" class=\"column-1\"><div style=\"text-align: center;padding-top:25px;font-weight: bold;color:#061b5a\">Multi-Label Classification<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">Accuracy<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">AWQ<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u2717<\/div><\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-2\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u00d8 Inference Time<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">AWQ<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u2713<\/div><\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td rowspan=\"2\" class=\"column-1\"><div style=\"text-align: center;padding-top:25px;font-weight: bold;color:#061b5a\">Binary Classification<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">Accuracy<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">GPTQ<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u2717<\/div><\/td>\n<\/tr>\n<tr class=\"row-5\">\n\t<td class=\"column-2\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u00d8 Inference Time<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">AWQ<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u2713<\/div><\/td>\n<\/tr>\n<tr class=\"row-6\">\n\t<td rowspan=\"2\" class=\"column-1\"><div style=\"text-align: center;padding-top:25px;font-weight: bold;color:#061b5a\">Question Answering<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u00d8 Performance<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">AWQ<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u2717<\/div><\/td>\n<\/tr>\n<tr class=\"row-7\">\n\t<td class=\"column-2\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u00d8 Inference Time<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">GPTQ<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u2713<\/div><\/td>\n<\/tr>\n<tr class=\"row-8\">\n\t<td rowspan=\"2\" class=\"column-1\"><div style=\"text-align: center;padding-top:25px;font-weight: bold;color:#061b5a\">Summarization<\/div><\/td><td class=\"column-2\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u00d8 Performance<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">AWQ<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u2717<\/div><\/td>\n<\/tr>\n<tr class=\"row-9\">\n\t<td class=\"column-2\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u00d8 Inference Time<\/div><\/td><td class=\"column-3\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">GPTQ<\/div><\/td><td class=\"column-4\"><div style=\"text-align: center;font-weight: normal;color:#061b5a\">\u2717<\/div><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<span id=\"tablepress-108-description\" class=\"tablepress-table-description tablepress-table-description-id-108\"><div style=\"text-align:center\">Comparison of the two quantization methods used (AWQ and GPTQ) in terms of the most important metrics, together with an indication of whether the differences are detected as significant, for each task type.<\/div><\/span>\n<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Summary-and-Outlook\"><\/span>Summary and Outlook<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>All in all, the results have shown to be beneficial in certain contexts, effectively boosting performance by combining LLMs. Particularly in tasks like Multi-Label Classification and Summarization, the combination overall outperformed the best single model in several metrics.<br \/>\nWe also saw that simpler consensus-building methods can often surpass more complex approaches like MAD in terms of resulting performance. The reduced inference time of simpler methods, which can reach consensus within one round, makes them particularly effective and beneficial in a wide range of scenarios.<br \/>\nOn top of that, the analysis of quantization methods revealed that AWQ models generally offer a better balance between efficiency and performance compared to GPTQ models. We saw that the AWQ approach is particularly advantageous for tasks that require short responses, such as Classification. However, GPTQ may be preferred for tasks needing longer answers, like Summarization.<\/p>\n<p>Moving forward, exploring additional datasets and task types can further validate these findings and potentially uncover new patterns. Enhancing model diversity and experimenting with different quantization techniques could lead to even greater performance improvements. Moreover, user tests in practical applications can provide valuable feedback on the real-world applicability and benefits of these approaches.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Sources\"><\/span>Sources<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li><a href=\"http:\/\/dx.doi.org\/10.1073\/pnas.2305016120\">F. Gilardi, M. Alizadeh, and M. Kubli, 2023: ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2303.17580\">Y. Shen et al., 2023: HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face<\/a><\/li>\n<li><a href=\"https:\/\/doi.org\/10.1007\/s10462-021-10097-x\">A. Amirkhani and A. H. Barshooi, 2022: Consensus in Multi-Agent Systems: A Review<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2305.14325\">Y. Du et al., 2023: Improving Factuality and Reasoning in Language Models through Multiagent Debate<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2306.00978\">J. Lin et al., 2024: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2210.17323\">E. Frantar et al., 2023: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Large Language Models (LLMs) are increasingly being used in automated environments to solve a wide range of tasks, such as categorizing customer feedback, summarizing lengthy research papers, and answering technical support queries. In this article, we test different strategies for boosting performance by combining LLMs.<\/p>\n","protected":false},"author":326,"featured_media":62669,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[509,511,373,206,817,140,141],"service":[76,431,75],"coauthors":[{"id":326,"display_name":"Robin Pavkovic","user_nicename":"rpavkovic"},{"id":146,"display_name":"Martin Kirchhoff","user_nicename":"mkirchhoff"}],"class_list":["post-61342","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-ai-2","tag-artificial-intelligence-2","tag-chatbot","tag-data-science","tag-inovex","tag-machine-learning","tag-nlp","service-artificial-intelligence","service-data-science","service-nlp"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Combining and Quantizing LLMs for Enhanced Performance - inovex GmbH<\/title>\n<meta name=\"description\" content=\"LLMs are increasingly used to solve a wide range of tasks. In this article, we test strategies for boosting performance by combining LLMs.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Combining and Quantizing LLMs for Enhanced Performance - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"LLMs are increasingly used to solve a wide range of tasks. In this article, we test strategies for boosting performance by combining LLMs.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2025-06-24T06:37:38+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Boostin_LLM_Performance.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1500\" \/>\n\t<meta property=\"og:image:height\" content=\"880\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Robin Pavkovic, Martin Kirchhoff\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Boostin_LLM_Performance-1024x601.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Robin Pavkovic\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"12\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Robin Pavkovic, Martin Kirchhoff\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/\"},\"author\":{\"name\":\"Robin Pavkovic\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/374a80c857b486f8f16fbcdc384d3a41\"},\"headline\":\"Combining and Quantizing LLMs for Enhanced Performance\",\"datePublished\":\"2025-06-24T06:37:38+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/\"},\"wordCount\":2380,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Boostin_LLM_Performance.png\",\"keywords\":[\"Ai\",\"Artificial Intelligence\",\"Chatbot\",\"Data Science\",\"inovex\",\"Machine Learning\",\"nlp\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/\",\"name\":\"Combining and Quantizing LLMs for Enhanced Performance - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Boostin_LLM_Performance.png\",\"datePublished\":\"2025-06-24T06:37:38+00:00\",\"description\":\"LLMs are increasingly used to solve a wide range of tasks. In this article, we test strategies for boosting performance by combining LLMs.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Boostin_LLM_Performance.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Boostin_LLM_Performance.png\",\"width\":1500,\"height\":880,\"caption\":\"Multiple Cards with LLM names swirling around a box. Some lay inside it.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/combining-and-quantizing-llms-for-enhanced-performance\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Combining and Quantizing LLMs for Enhanced Performance\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/374a80c857b486f8f16fbcdc384d3a41\",\"name\":\"Robin Pavkovic\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/pb-96x96.png923061120235a455546ad57ffdd88ef4\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/pb-96x96.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/pb-96x96.png\",\"caption\":\"Robin Pavkovic\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/rpavkovic\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Combining and Quantizing LLMs for Enhanced Performance - inovex GmbH","description":"LLMs are increasingly used to solve a wide range of tasks. In this article, we test strategies for boosting performance by combining LLMs.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/","og_locale":"de_DE","og_type":"article","og_title":"Combining and Quantizing LLMs for Enhanced Performance - inovex GmbH","og_description":"LLMs are increasingly used to solve a wide range of tasks. In this article, we test strategies for boosting performance by combining LLMs.","og_url":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2025-06-24T06:37:38+00:00","og_image":[{"width":1500,"height":880,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Boostin_LLM_Performance.png","type":"image\/png"}],"author":"Robin Pavkovic, Martin Kirchhoff","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/Boostin_LLM_Performance-1024x601.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Robin Pavkovic","Gesch\u00e4tzte Lesezeit":"12\u00a0Minuten","Written by":"Robin Pavkovic, Martin Kirchhoff"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/"},"author":{"name":"Robin Pavkovic","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/374a80c857b486f8f16fbcdc384d3a41"},"headline":"Combining and Quantizing LLMs for Enhanced Performance","datePublished":"2025-06-24T06:37:38+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/"},"wordCount":2380,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Boostin_LLM_Performance.png","keywords":["Ai","Artificial Intelligence","Chatbot","Data Science","inovex","Machine Learning","nlp"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/","url":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/","name":"Combining and Quantizing LLMs for Enhanced Performance - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Boostin_LLM_Performance.png","datePublished":"2025-06-24T06:37:38+00:00","description":"LLMs are increasingly used to solve a wide range of tasks. In this article, we test strategies for boosting performance by combining LLMs.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Boostin_LLM_Performance.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Boostin_LLM_Performance.png","width":1500,"height":880,"caption":"Multiple Cards with LLM names swirling around a box. Some lay inside it."},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/combining-and-quantizing-llms-for-enhanced-performance\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Combining and Quantizing LLMs for Enhanced Performance"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/374a80c857b486f8f16fbcdc384d3a41","name":"Robin Pavkovic","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/pb-96x96.png923061120235a455546ad57ffdd88ef4","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/pb-96x96.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/pb-96x96.png","caption":"Robin Pavkovic"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/rpavkovic\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/61342","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/326"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=61342"}],"version-history":[{"count":6,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/61342\/revisions"}],"predecessor-version":[{"id":62844,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/61342\/revisions\/62844"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/62669"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=61342"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=61342"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=61342"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=61342"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}