Large Language Models have become essential for many applications but for repairing vulnerabilities there are important considerations.
In this Blogpost we want to discuss some considerations for repairing vulnerabilities using LLMs. We will do this using an example to show that our considerations are not only of theoretical nature. But these effects should be addressed in other LLMs with different contexts as well, especially if you are using a pre-trained LLM.
Large Language Model
We used CodeT5 as a starting point since it was a specialised LLM for code generation and understanding that has already been used a lot [1]. It is a pre-trained LLM that was trained on various programming languages including Java. We then proceeded to fine-tune CodeT5 to the specific task of repairing Java deserialisation vulnerabilities in Java.
The Log4j or Log4Shell vulnerability is a very famous example of that vulnerability type. Deserialisation vulnerabilities are usually rather severe since the attacker could remotely execute code on the computer of the target.
Dataset
For our dataset we used the MoreFixes dataset v1 that contained a few hundred repair instances after our various preprocessing steps [2].
Defining a Repair
Defining what a repair is is a lot harder then it first seems due various different problems. First one being the difficulty of checking if the vulnerability still persists. Especially if you don’t have tests that can check that easily or if you can’t even reproduce the vulnerability since you do not have the full code and dependencies. One way to approach this is by using the commits that supposedly solved the issue as ground truth, which brings forth difficulties of its own (e.g., trusting that the vulnerability was solved and that nothing else is within those commits).
Other considerations for repairing vulnerabilities using LLMs include the context size (e.g., only the line of the vulnerability) and the formatting (e.g., a repair is the change that needs to be done to the original code such that it is not vulnerable anymore). Choosing a good context size and formatting is not a trivial matter.
Preprocessing
The phrase „Garbage in, garbage out.“ is probably quite known and it might be the case especially for LLMs. Due to the large quantities of data they are trained on it is a lot harder to manually comb through the data to ensure high quality, which might drastically decrease generalisability and overall performance. Since we have noticed that this is often not discussed in a lot of detail in the literature besides tokenisation we have analysed some of its effects.
We wanted to see the effects of different preprocessing methods and as such we fine-tuned the CodeT5 model multiple times and tested out different combinations of processing. The model was fine-tuned to predict the diff from a vulnerable file to a repaired one, thus not the whole repair but only parts of it. As input it is given only the original code (possibly formatted depending on the dataset). In the image below you can see our whole processing pipeline.

Formatting
As discussed before, formatting our repair in a certain way may yield different effects. For this we looked at different categories how a repair could be formatted. One option we mentioned before was to use the changes that need to be done to remove the vulnerability. An example of this could be the diff used in GitHub repositories. It is a lot more efficient than just using the new repaired code since it only needs to keep track of changed lines. But it still has the property that the changes are precisely defined unlike other formats where this might be a problem [3].
We decided to use Unidiff (similar to what is used in git commits) without additional context lines to minimise the context length. And since the additional lines are mostly intended for humans for better understanding it is not needed here, especially since from the original code and the diff the output can be reconstructed. This enables us to use the whole vulnerable file as input while having a faster inference time and less context size requirements compared to predicting the whole output file. Additionally using this input and output makes it a lot simpler to use since the output is a list of lines of the Unidiff. And thus it can be applied to the input using git with little effort.
Filtering
Another consideration for repairing vulnerabilities using LLMs is the need to filter out certain elements. This can be since they are duplicates or more specific in this case because of multiple different repairs for the same vulnerability. In general this can make sense as we could repair by using blacklists or whitelists, but for our type of LLM and evaluation this does not make sense, since we do not have a good way of checking if our result actually fixes the problem (e.g., using tests). Because of this circumstance we have decided that keeping one repair might be useful. Another solution would have been to compare a predicted repair to all possible repairs and choose the best value but this would also bias the model towards vulnerabilities with different repairs.
In contrast to before one could also have the same repair apply for different vulnerabilities which makes sense code wise if the vulnerable code is equal but might have the effect that the model mostly focuses on those cases and thus fails to generalise more. As our goal is not to repair one specific vulnerable code section no matter where it is contained but instead learn form the different examples what a deserialisation vulnerability looks like and how it is repaired. Additionally this might be seen as somewhat of a data leakage if some parts are used in the training and some in testing since the output is the same and the input is likely to be somewhat similar in the vulnerable region.
Splitting
As for splitting one might argue that it is one of the most important processing steps since bad splitting could cause uneven coverage of different types of vulnerabilities (e.g., only showing repairs with whitelisting) and thus lose the ability to generalise. But likely equally important is the potential dependence of time, as we can only learn on vulnerabilities of the present and past but want our model to also repair future vulnerabilities, as they might change over time. This means that using a random split would not result in reliable performance metrics, as it would only show how well it can repair vulnerabilities that are likely to be similar to those it has already seen. And thus would not be as much of a generalisation.
One aspect that is important for pre-trained models is that the test set of your fine-tuning should by that same logic also come after the knowledge cutoff of the pre-trained model. Luckily after splitting the dataset with respect to the time, the earliest repair from the test set was after the publication date of CodeT5. This data leakage (from pre-trained model to fine-tuning test set) would also highly invalidate the claimed performance. Since it might have seen the exact data point it should predict during testing of the fine-tuned model in the training set of the pre-training.
Evaluation
As hinted towards before, it is not trivial to evaluate the fine-tuned LLMs and for that sake we used 3 different metrics. The first one being the percentage of perfect predictions (PP) where the prediction has to match the ground truth exactly. This is the standard metric that is used but can be extended with the help of the others to gain some further insights.
The next performance metric is the list length prediction (LLP) being the number of Unidiff lines of the output normalised to the number of lines of the ground truth. It helps to understand if the output is of the approximately right length and thus helps to identify certain problems the model is having. It can take on any positive value including 0.
Lastly the perfect entry prediction (PEP). It measures the percentage of lines from the output that are correctly predicted and at the right position. Using this metric especially in conjunction with PP we can more accurately see if we are close to getting good predictions in total or if we might have some problems with data leakage. Since if the PEP and the PP are very similar most entries contributing to the PEP are due to being fully correct and thus all others must be having a very low PEP value. On the other hand if the PP value is small compared to the PEP it is more likely to have actually learned something.
One could also look at the distribution of the PEP since it gives a distribution of how well the model performs. If there is a peak on the lower end and one on the upper end this is a clear indication that there might be some data leakage but definitely that the model did not generalise properly.
Results
We fine-tuned and tested 16 different datasets that we created. They are named in the following using a subscript and semicolons to separate the properties. The number between 1 and 3 stands for the formatting step, where 1 means no formatting, 2 stands for only linting and 3 is for linting and simplifying. Next are either “sv“, “ts“ or “in“ which signify “single value“, “temporal split“ and “injective“ respectively.
We can see that the performance mostly increases when using linting and even more when using linting and simplifying. For single value it seems to mostly benefit the performance, which makes sense since it removes the need to guess which of the theoretically right repairs is counted as right. The temporal split massively reduces the performance, which most likely stems from the reduction of data leakage. Lastly the dataset being injective also reduces the performance which likely comes from the fact that it removes „lazy behaviour“ i.e., massively simplified the model is learning that some repairs happen often and it can easily get higher performances when it generates those repairs more.
Conclusion
In conclusion we found that the processing of the dataset has a huge importance. Not only to increase the performance using “single value“ if there is no other way of dealing with multiple outputs for the same input. But also for more credible results due to reducing potential data leakage and training/fine-tuning it such that it matches its usecase more accurately.
This shows that the importance of processing the dataset needs to be discussed and addressed more as it seems to have become less relevant. Those considerations for repairing vulnerabilities using LLMs also do not seem to be bound to this specific task or context and thus should be addressed in different contexts as well.
Let it be noted that this should to be done on other datasets as well, especially those that have a output that is easily verifiable to get more conclusive results. Additionally previous research could be done to analyse if these methods prove to get more reliable performances.
References
[1] Yue Wang et al. “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation“. arXiv preprint arXiv:2109.00859 (2021)
[2] J. Akhoundali, S. Rahim Nouri, K. F. D. Rietveldund O. GADYATSKAYA, „MoreFixes: Largest CVE dataset with fixes“. Zenodo, Mai 17, 2024. doi: 10.5281/zenodo.11199120.
[3] Zimin Chen, Steve Kommrusch, and Martin Monperrus. “Neural Transfer Learning for Repairing Security Vulnerabilities in C Code“. IEEE Transactions on Software Engineering (2022). url: https://arxiv.org/pdf/2104.08308