{"id":55529,"date":"2025-03-28T11:27:30","date_gmt":"2025-03-28T10:27:30","guid":{"rendered":"https:\/\/www.inovex.de\/?p=55529"},"modified":"2025-06-18T13:42:52","modified_gmt":"2025-06-18T11:42:52","slug":"beating-gpt-4-fine-tuning-llama-3-for-software-security","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/","title":{"rendered":"Beating GPT-4: Fine-Tuning Llama-3 for Software Security"},"content":{"rendered":"<p>In this blog post, I will walk you through the process of fine-tuning and evaluating <a href=\"https:\/\/github.com\/meta-llama\/llama-models\/blob\/main\/models\/llama3\/MODEL_CARD.md\">Llama-3<\/a> and <a href=\"https:\/\/huggingface.co\/deepseek-ai\/DeepSeek-Coder-V2-Instruct\">DeepSeek-Coder<\/a> for detecting software vulnerabilities. I will share valuable insights and key takeaways gained from this work.<\/p>\n<p>Large Language Models (LLMs) have become present everywhere on the internet, known for their versatility in performing a wide range of tasks. At inovex, we leverage LLMs for various internal and external projects. In my thesis, I explored the potential of LLMs to detect software vulnerabilities and enhance their performance through fine-tuning.<\/p>\n<p><!--more--><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_79_2 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#The-Problem\" >The Problem<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#Experimental-Evaluation\" >Experimental Evaluation<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#Dataset\" >Dataset<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#Prompt-design\" >Prompt design<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#Results-of-pre-trained-LLMs\" >Results of pre-trained LLMs<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#Fine-Tuning-Llama-3-and-DeepSeek-Coder\" >Fine-Tuning Llama-3 and DeepSeek-Coder<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#Fine-Tuned-Results\" >Fine-Tuned Results<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#Performance-per-CWE\" >Performance per CWE<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#Conclusion\" >Conclusion<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#References\" >References<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"The-Problem\"><\/span>The Problem<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Software vulnerabilities can emerge in any project, no matter its size. Detecting them quickly is important, as undetected vulnerabilities can result in significant financial losses and mitigate user trust. [1]<br \/>\nTo address this issue, static application security testing (SAST) tools are commonly used, which rely on pattern matching, data and control flow analysis, and more. However, traditional SAST tools have limitations in identifying more complex vulnerabilities and cannot adapt to new ones. This is where advanced approaches, such as leveraging machine learning, come into play. These approaches are able to generalize to unseen cases and detect emerging vulnerabilities. Whereby fine-tuning LLMs, I aimed to improve the detection of intricate vulnerabilities that conventional tools might miss.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Experimental-Evaluation\"><\/span>Experimental Evaluation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In the following, I will introduce the dataset and the prompts used for the experiments. Figure 1 shows the fine-tuning and evaluation process, offering a structured overview.<\/p>\n<figure id=\"attachment_56869\" aria-describedby=\"caption-attachment-56869\" style=\"width: 640px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-56869 size-large\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/vulnerability_detection_pipeline-1024x680.png\" alt=\"overview of the llm fine-tuning process\" width=\"640\" height=\"425\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/vulnerability_detection_pipeline-1024x680.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/vulnerability_detection_pipeline-300x199.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/vulnerability_detection_pipeline-768x510.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/vulnerability_detection_pipeline-1536x1019.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/vulnerability_detection_pipeline-1920x1274.png 1920w, https:\/\/www.inovex.de\/wp-content\/uploads\/vulnerability_detection_pipeline-400x265.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/vulnerability_detection_pipeline-360x239.png 360w, https:\/\/www.inovex.de\/wp-content\/uploads\/vulnerability_detection_pipeline.png 2010w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><figcaption id=\"caption-attachment-56869\" class=\"wp-caption-text\">Figure 1: Fine-tuning and evaluation pipeline.<\/figcaption><\/figure>\n<h3><span class=\"ez-toc-section\" id=\"Dataset\"><\/span>Dataset<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>I chose to use the <a href=\"https:\/\/osf.io\/d45bw\/\">Draper VDISC<\/a> dataset, as it provides 1.27 million synthetic and real-world function-level samples of C and C++ source code. Each data point is labeled with its respective <a href=\"https:\/\/cwe.mitre.org\/index.html\">Common Weakness Enumeration (CWE)<\/a>, summarized in Table 1.<\/p>\n\n<table id=\"tablepress-100\" class=\"tablepress tablepress-id-100\" aria-describedby=\"tablepress-100-description\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\">CWE<\/th><th class=\"column-2\">Frequency<\/th><th class=\"column-3\">Description<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-striping row-hover\">\n<tr class=\"row-2\">\n\t<td class=\"column-1\">CWE-120<\/td><td class=\"column-2\">3.70%<\/td><td class=\"column-3\">Buffer Copy without Checking Size of Input (\u201aClassic Buffer Overflow\u2018)<\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\">CWE-119<\/td><td class=\"column-2\">1.90%<\/td><td class=\"column-3\">Improper Restriction of Operations within the Bounds of a Memory Buffer<\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td class=\"column-1\">CWE-469<\/td><td class=\"column-2\">0.95%<\/td><td class=\"column-3\">Use of Pointer Subtraction to Determine Size<\/td>\n<\/tr>\n<tr class=\"row-5\">\n\t<td class=\"column-1\">CWE-476<\/td><td class=\"column-2\">0.21%<\/td><td class=\"column-3\">NULL Pointer Dereference<\/td>\n<\/tr>\n<tr class=\"row-6\">\n\t<td class=\"column-1\">CWE-other<\/td><td class=\"column-2\">2.70%<\/td><td class=\"column-3\">Improper Input Validation, Use of Uninitialized Variables, Buffer Access with Incorrect Length Value, etc.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<span id=\"tablepress-100-description\" class=\"tablepress-table-description tablepress-table-description-id-100\">Table 1: Types of vulnerabilities in the Draper VDISC dataset and their frequencies across the entire dataset.<\/span>\n<!-- #tablepress-100 from cache -->\n<p>This dataset was chosen as most system-level vulnerabilities emerge from C and C++ code. Since both C and C++ allow for manual memory handling, they provide programmers with a high degree of control and flexibility over system resources. However, this flexibility comes at the cost of increased risk of memory management errors, such as buffer overflows, use-after-free vulnerabilities, and dangling pointers.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Prompt-design\"><\/span>Prompt design<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>As prompt design has a great impact on the performance of LLMs, I decided to use well-performing prompts from related works instead of creating new ones. The used prompts are displayed in Table 2. Both prompts provide a binary response, allowing for easy automatic classification.<\/p>\n\n<table id=\"tablepress-101\" class=\"tablepress tablepress-id-101\" aria-describedby=\"tablepress-101-description\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\">Prompt ID<\/th><th class=\"column-2\">Prompt<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-striping row-hover\">\n<tr class=\"row-2\">\n\t<td class=\"column-1\">PD-1 [2]<\/td><td class=\"column-2\">Is this code vulnerable? Answer in only Yes or No.<\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\">PD-2 [3]<\/td><td class=\"column-2\">I want you to act as a vulnerability detection system. Is the following function buggy? Please answer Yes or No.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<span id=\"tablepress-101-description\" class=\"tablepress-table-description tablepress-table-description-id-101\">Table 2: Prompts used for the evaluation and fine.<\/span>\n<!-- #tablepress-101 from cache -->\n<h2><span class=\"ez-toc-section\" id=\"Results-of-pre-trained-LLMs\"><\/span>Results of pre-trained LLMs<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>To compare the effectiveness of the pre-trained and fine-tuned models, I conducted an evaluation of several pre-trained large language models. Among these, GPT-4-0613 was used as the baseline for comparison. This approach allowed for a clear assessment of the improvements gained through fine-tuning.<\/p>\n<p>Due to the limitations of most models in providing strictly &#8222;yes&#8220; or &#8222;no&#8220; answers, I initially reviewed some outputs manually and identified specific keywords for pattern matching the results. Examples of keywords for positive responses include \u201cyes\u201c and \u201cis vulnerable\u201c. Additionally, responses that were not automatically classified using these keywords were then manually reviewed and classified.<\/p>\n\n<table id=\"tablepress-98\" class=\"tablepress tablepress-id-98\" aria-describedby=\"tablepress-98-description\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\">Model<\/th><th class=\"column-2\">Prompt<\/th><th class=\"column-3\">Unclassified<\/th><th class=\"column-4\">Precision<\/th><th class=\"column-5\">Recall<\/th><th class=\"column-6\">F1-score<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-striping row-hover\">\n<tr class=\"row-2\">\n\t<td class=\"column-1\">Deepseek-coder-6.7b-instruct<\/td><td class=\"column-2\">PD-1<\/td><td class=\"column-3\">143<\/td><td class=\"column-4\">0.040<\/td><td class=\"column-5\">0.378 <\/td><td class=\"column-6\">0.072<\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\">GPT-4-0613<\/td><td class=\"column-2\">PD-1<\/td><td class=\"column-3\">1<\/td><td class=\"column-4\">0.164<\/td><td class=\"column-5\">0.442<\/td><td class=\"column-6\">0.239<\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td class=\"column-1\">Llama-2-7b-chat<\/td><td class=\"column-2\">PD-1<\/td><td class=\"column-3\">129<\/td><td class=\"column-4\">0.052 <\/td><td class=\"column-5\">0.957 <\/td><td class=\"column-6\">0.099<\/td>\n<\/tr>\n<tr class=\"row-5\">\n\t<td class=\"column-1\">Llama-2-7b-chat<\/td><td class=\"column-2\">PD-2<\/td><td class=\"column-3\">394<\/td><td class=\"column-4\">0.051<\/td><td class=\"column-5\">1.000<\/td><td class=\"column-6\">0.098<\/td>\n<\/tr>\n<tr class=\"row-6\">\n\t<td class=\"column-1\">Llama-3-8B-Instruct<\/td><td class=\"column-2\">PD-1<\/td><td class=\"column-3\">138<\/td><td class=\"column-4\">0.073<\/td><td class=\"column-5\">0.638<\/td><td class=\"column-6\">0.131<\/td>\n<\/tr>\n<tr class=\"row-7\">\n\t<td class=\"column-1\">Mistral-7B-Instruct-v0.2<\/td><td class=\"column-2\">PD-1<\/td><td class=\"column-3\">360<\/td><td class=\"column-4\">0.055<\/td><td class=\"column-5\">0.513<\/td><td class=\"column-6\">0.100<\/td>\n<\/tr>\n<tr class=\"row-8\">\n\t<td class=\"column-1\">Mistral-7B-Instruct-v0.2<\/td><td class=\"column-2\">PD-2<\/td><td class=\"column-3\">941<\/td><td class=\"column-4\">0.103<\/td><td class=\"column-5\">0.600<\/td><td class=\"column-6\">0.177<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<span id=\"tablepress-98-description\" class=\"tablepress-table-description tablepress-table-description-id-98\">Table 3: Comparing pre-trained models on vulnerability detection<\/span>\n<!-- #tablepress-98 from cache -->\n<p>The results indicated that the prompt design PD-1 performed greatly better than PD-2. Especially in providing a reasonable and classifiable output.<br \/>\nGPT-4 performed best overall and did a great job in only returning \u201cyes\u201c or \u201cno\u201c with just one sample remaining unclassified.<\/p>\n<p>For the fine-tuning process I chose Llama-3 and DeepSeek-Coder. Llama-3 has the best performance among the open-weight models, and DeepSeek-Coder was mainly trained on source-code, which could help with understanding the given code.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Fine-Tuning-Llama-3-and-DeepSeek-Coder\"><\/span>Fine-Tuning Llama-3 and DeepSeek-Coder<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>For fine-tuning both models, I used the script provided by Phil Schmid in his <a href=\"https:\/\/www.philschmid.de\/fsdp-qlora-llama3\">blog post<\/a> about fine-tuning LLama-3-70B. I will share interesting insights gained from the experiments that could be important for further experiments in that direction. The insights are pretty similar between Llama-3 and DeepSeek-Coder. Some of these might also apply to other tasks where the LLM should classify things in binary classes. The insights concern the chat template for Llama-3 and the hyperparameters.<\/p>\n<p>Chat template:<br \/>\nPhil Schmid provided an Anthropic-\/ Vicuna-like chat template and used it in his work. In my experiments, I discovered that the Llama-3 specific chat template greatly improved the performance. Since the model was trained on the Llama-3 template, it can better use its previous knowledge. The number of unclassified samples of the fine-tuned models was reduced from around 20% to 0.<\/p>\n<p>Batch size:<br \/>\nTypically, a larger batch size ensures stability during fine-tuning and allows for a more stable process. A batch size greater than 4 led to almost 0 as positive classified samples. As the dataset is unbalanced, each batch contains more negative samples than positive, which leads to training mostly towards negative responses.<\/p>\n<p>Training split:<br \/>\nThere exist several approaches to generating a training dataset. Most sources suggest a 50:50 split [4] to reduce overfitting on the majority class. During my tests, a training dataset where 25% of the samples were vulnerable worked the best. The given dataset split does not provide enough vulnerable samples, and a 50:50 split led to too many false positives.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Fine-Tuned-Results\"><\/span>Fine-Tuned Results<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Table 4 shows the results from the best fine-tuned models as well as the best pre-trained model. Fine-tuning the models greatly improves their performance and allows them to outperform GPT-4-0613, with an F1-score almost twice as high.<\/p>\n\n<table id=\"tablepress-102\" class=\"tablepress tablepress-id-102\" aria-describedby=\"tablepress-102-description\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\">Model<\/th><th class=\"column-2\">Unclassified<\/th><th class=\"column-3\">Precision<\/th><th class=\"column-4\">Recall<\/th><th class=\"column-5\">F1-score<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-striping row-hover\">\n<tr class=\"row-2\">\n\t<td class=\"column-1\">Llama-3<\/td><td class=\"column-2\">0<\/td><td class=\"column-3\">0.391<\/td><td class=\"column-4\">0.457<\/td><td class=\"column-5\">0.422<\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\">Deepseek-Coder<\/td><td class=\"column-2\">0<br \/>\n<\/td><td class=\"column-3\">0.277<\/td><td class=\"column-4\">0.440<\/td><td class=\"column-5\">0.340<\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td class=\"column-1\">GPT-4-0613<\/td><td class=\"column-2\">1<\/td><td class=\"column-3\">0.164<\/td><td class=\"column-4\">0.442<\/td><td class=\"column-5\">0.239<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<span id=\"tablepress-102-description\" class=\"tablepress-table-description tablepress-table-description-id-102\">Table 4: Comparison between fine-tuned models and the best pre-trained model.<\/span>\n<!-- #tablepress-102 from cache -->\n<h3><span class=\"ez-toc-section\" id=\"Performance-per-CWE\"><\/span>Performance per CWE<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>An interesting measurement is the performance per CWE class. Table 5 shows the accuracy of the best-performing fine-tuned Llama-3 model. Its performance is noticeably low for <a href=\"https:\/\/cwe.mitre.org\/data\/definitions\/476.html\">CWE-476<\/a> which is a NULL pointer dereference. This kind of vulnerability cannot be detected well in a function itself. The model might need more context of how the function is used. Also, the performance for the CWE-other class is considerably low, which could arise from few examples per specific CWEs in the training data.<\/p>\n\n<table id=\"tablepress-103\" class=\"tablepress tablepress-id-103\" aria-describedby=\"tablepress-103-description\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\">CWE<\/th><th class=\"column-2\">Correct<\/th><th class=\"column-3\">Incorrect<\/th><th class=\"column-4\">Accuracy<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-striping row-hover\">\n<tr class=\"row-2\">\n\t<td class=\"column-1\">CWE-119<\/td><td class=\"column-2\">60<\/td><td class=\"column-3\">42<\/td><td class=\"column-4\">0.588<\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\">CWE-120<\/td><td class=\"column-2\">56<\/td><td class=\"column-3\">36<\/td><td class=\"column-4\">0.609<\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td class=\"column-1\">CWE-469<\/td><td class=\"column-2\">5<\/td><td class=\"column-3\">3<\/td><td class=\"column-4\">0.625<\/td>\n<\/tr>\n<tr class=\"row-5\">\n\t<td class=\"column-1\">CWE-476<\/td><td class=\"column-2\">5<\/td><td class=\"column-3\">51<\/td><td class=\"column-4\">0.089<\/td>\n<\/tr>\n<tr class=\"row-6\">\n\t<td class=\"column-1\">CWE-other<\/td><td class=\"column-2\">34<\/td><td class=\"column-3\">58<\/td><td class=\"column-4\">0.370<\/td>\n<\/tr>\n<tr class=\"row-7\">\n\t<td class=\"column-1\">Non-vulnerable<\/td><td class=\"column-2\">4401<\/td><td class=\"column-3\">249<\/td><td class=\"column-4\">0.946<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<span id=\"tablepress-103-description\" class=\"tablepress-table-description tablepress-table-description-id-103\">Table 5: Performance for each CWE class achieved by the best-performing Llama-3<br \/>\nmodel.<\/span>\n<!-- #tablepress-103 from cache -->\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In this blog post, LLMs were used for software vulnerability detection, both in their pre-trained and fine-tuned state. Fine-tuning greatly helps the model to understand the given task and reduce the amount of false positives, allowing it to beat GPT-4.<\/p>\n<p>Nevertheless, this experiment shows that there is still a lot of work to be done before LLMs can be used for reliable vulnerability detection. One important step is creating a large training dataset with reliable labels. Another step could include providing additional context to improve the classification of complex vulnerabilities.<\/p>\n<p>A <a title=\"How to Detect Software Vulnerabilities in Source Code Using Machine Learning\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-detect-software-vulnerabilities-in-source-code-using-machine-learning\/\">different work <\/a>using abstract syntax trees from Feras Zaher-Alnaem achieved better results than the fine-tuned LLMs.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"References\"><\/span>References<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>[1] A. Anwar et al. \u201cMeasuring the cost of software vulnerabilities\u201c, 2020.<br \/>\n[2] Moumita D. Purba et al. \u201cSoftware Vulnerability Detection using Large Language Models&#8220;. 2023.<br \/>\n[3] Benjamin Steenhoek et al. \u201cA Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection\u201c. 2024.<br \/>\n[4] Susan S, Kumar A. \u201cThe balancing trick: Optimized sampling of imbalanced datasets\u2014A brief survey of the recent State of the Art\u201c. 2021.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post, I will walk you through the process of fine-tuning and evaluating Llama-3 and DeepSeek-Coder for detecting software vulnerabilities. I will share valuable insights and key takeaways gained from this work. Large Language Models (LLMs) have become present everywhere on the internet, known for their versatility in performing a wide range of [&hellip;]<\/p>\n","protected":false},"author":417,"featured_media":61562,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[511,140,101],"service":[76,879],"coauthors":[{"id":417,"display_name":"Tim Diercks","user_nicename":"tdiercks"}],"class_list":["post-55529","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-artificial-intelligence-2","tag-machine-learning","tag-security","service-artificial-intelligence","service-security"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Beating GPT-4: Fine-Tuning Llama-3 for Software Security - inovex GmbH<\/title>\n<meta name=\"description\" content=\"Learn how we beat GPT-4 in software security, boosting performance and sharing insights from our process of fine-tuning Llama-3.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Beating GPT-4: Fine-Tuning Llama-3 for Software Security - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"Learn how we beat GPT-4 in software security, boosting performance and sharing insights from our process of fine-tuning Llama-3.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2025-03-28T10:27:30+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-18T11:42:52+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1500\" \/>\n\t<meta property=\"og:image:height\" content=\"880\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Tim Diercks\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4-1024x601.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Tim Diercks\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"6\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Tim Diercks\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/\"},\"author\":{\"name\":\"Tim Diercks\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/person\/9c0e53a5c0f6ac0196dfa1cf2f2dba50\"},\"headline\":\"Beating GPT-4: Fine-Tuning Llama-3 for Software Security\",\"datePublished\":\"2025-03-28T10:27:30+00:00\",\"dateModified\":\"2025-06-18T11:42:52+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/\"},\"wordCount\":1127,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png\",\"keywords\":[\"Artificial Intelligence\",\"Machine Learning\",\"Security\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/\",\"url\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/\",\"name\":\"Beating GPT-4: Fine-Tuning Llama-3 for Software Security - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png\",\"datePublished\":\"2025-03-28T10:27:30+00:00\",\"dateModified\":\"2025-06-18T11:42:52+00:00\",\"description\":\"Learn how we beat GPT-4 in software security, boosting performance and sharing insights from our process of fine-tuning Llama-3.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#primaryimage\",\"url\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png\",\"contentUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png\",\"width\":1500,\"height\":880,\"caption\":\"Illustration einer Person, die an einem Laptop arbeitet, umgeben von einem Netzwerk aus Warnsymbolen und dem gro\u00dfen Schriftzug \\\"AI\\\".\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.inovex.de\/de\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Beating GPT-4: Fine-Tuning Llama-3 for Software Security\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.inovex.de\/de\/#website\",\"url\":\"https:\/\/www.inovex.de\/de\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.inovex.de\/de\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.inovex.de\/de\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\/\/www.inovex.de\/de\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/inovexde\",\"https:\/\/x.com\/inovexgmbh\",\"https:\/\/www.instagram.com\/inovexlife\/\",\"https:\/\/www.linkedin.com\/company\/inovex\",\"https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/person\/9c0e53a5c0f6ac0196dfa1cf2f2dba50\",\"name\":\"Tim Diercks\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/person\/image\/a19cd72587efd1be707abed85696fd69\",\"url\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-unnamed-96x96.jpg\",\"contentUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-unnamed-96x96.jpg\",\"caption\":\"Tim Diercks\"},\"sameAs\":[\"https:\/\/www.linkedin.com\/in\/tim-diercks\/\"],\"url\":\"https:\/\/www.inovex.de\/de\/blog\/author\/tdiercks\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Beating GPT-4: Fine-Tuning Llama-3 for Software Security - inovex GmbH","description":"Learn how we beat GPT-4 in software security, boosting performance and sharing insights from our process of fine-tuning Llama-3.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/","og_locale":"de_DE","og_type":"article","og_title":"Beating GPT-4: Fine-Tuning Llama-3 for Software Security - inovex GmbH","og_description":"Learn how we beat GPT-4 in software security, boosting performance and sharing insights from our process of fine-tuning Llama-3.","og_url":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2025-03-28T10:27:30+00:00","article_modified_time":"2025-06-18T11:42:52+00:00","og_image":[{"width":1500,"height":880,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png","type":"image\/png"}],"author":"Tim Diercks","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4-1024x601.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Tim Diercks","Gesch\u00e4tzte Lesezeit":"6\u00a0Minuten","Written by":"Tim Diercks"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/"},"author":{"name":"Tim Diercks","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/9c0e53a5c0f6ac0196dfa1cf2f2dba50"},"headline":"Beating GPT-4: Fine-Tuning Llama-3 for Software Security","datePublished":"2025-03-28T10:27:30+00:00","dateModified":"2025-06-18T11:42:52+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/"},"wordCount":1127,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png","keywords":["Artificial Intelligence","Machine Learning","Security"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/","url":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/","name":"Beating GPT-4: Fine-Tuning Llama-3 for Software Security - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png","datePublished":"2025-03-28T10:27:30+00:00","dateModified":"2025-06-18T11:42:52+00:00","description":"Learn how we beat GPT-4 in software security, boosting performance and sharing insights from our process of fine-tuning Llama-3.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Beating_GPT-4.png","width":1500,"height":880,"caption":"Illustration einer Person, die an einem Laptop arbeitet, umgeben von einem Netzwerk aus Warnsymbolen und dem gro\u00dfen Schriftzug \"AI\"."},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/beating-gpt-4-fine-tuning-llama-3-for-software-security\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Beating GPT-4: Fine-Tuning Llama-3 for Software Security"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/9c0e53a5c0f6ac0196dfa1cf2f2dba50","name":"Tim Diercks","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/image\/a19cd72587efd1be707abed85696fd69","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-unnamed-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-unnamed-96x96.jpg","caption":"Tim Diercks"},"sameAs":["https:\/\/www.linkedin.com\/in\/tim-diercks\/"],"url":"https:\/\/www.inovex.de\/de\/blog\/author\/tdiercks\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/55529","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/417"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=55529"}],"version-history":[{"count":5,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/55529\/revisions"}],"predecessor-version":[{"id":62647,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/55529\/revisions\/62647"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/61562"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=55529"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=55529"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=55529"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=55529"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}