Analysis of software code preprocessing methods to improve the effectiveness of using large language models in vulnerability detection tasks
- 作者: Charugin V.V.1, Charugin V.V.1, Stavtsev A.V.1, Chesalin A.N.1
 - 
							隶属关系: 
							
- MIREA – Russian Technological University
 
 - 期: 卷 12, 编号 3 (2025)
 - 页面: 67-79
 - 栏目: SYSTEM ANALYSIS, INFORMATION MANAGEMENT AND PROCESSING, STATISTICS
 - URL: https://journals.eco-vector.com/2313-223X/article/view/695701
 - DOI: https://doi.org/10.33693/2313-223X-2025-12-3-67-79
 - EDN: https://elibrary.ru/BCEAHN
 - ID: 695701
 
如何引用文章
详细
As software systems grow in scale and complexity, the need for intelligent methods of vulnerability detection increases. One such method involves the use of large language models trained on source code, which are capable of analyzing and classifying vulnerable code segments at early stages of development. The effectiveness of these models depends on how the code is represented and how the input data is prepared. Preprocessing methods can significantly impact the accuracy and robustness of the model. The purpose of the study: to analyze the impact of various code preprocessing methods on the accuracy and robustness of large language models (CodeBERT, GraphCodeBERT, UniXcoder) in vulnerability detection tasks. The analysis is conducted using source code changes extracted from commits associated with vulnerabilities documented in the CVE database. The research methodology is an experimental analysis based on evaluation of the effectiveness and robustness of CodeBERT, GraphCodeBERT, and UniXcoder in the task of vulnerability classification. The models are assessed based on their performance using Accuracy and F1 score metrics. Research results: estimates of the effectiveness of different code preprocessing methods when applying large language models to vulnerability classification tasks.
全文:
作者简介
Valery Charugin
MIREA – Russian Technological University
							编辑信件的主要联系方式.
							Email: charugin_v@mirea.ru
				                	ORCID iD: 0009-0003-4950-7726
				                	SPIN 代码: 4080-4997
																		                								
lecturer, Department of Computer and Information Security, Institute of Artificial Intelligence
俄罗斯联邦, MoscowValentin Charugin
MIREA – Russian Technological University
														Email: charugin@mirea.ru
				                	ORCID iD: 0009-0001-1450-0714
				                	SPIN 代码: 7264-9403
																		                								
lecturer, Department of Computer and Information Security, Institute of Artificial Intelligence
俄罗斯联邦, MoscowAlexey Stavtsev
MIREA – Russian Technological University
														Email: stavcev@mirea.ru
				                					                	SPIN 代码: 4948-2180
																		                								
Cand. Sci. (Phys.-Math.), associate professor, Department of Computer and Information Security, Institute of Artificial Intelligence
俄罗斯联邦, MoscowAlexander Chesalin
MIREA – Russian Technological University
														Email: chesalin_an@mail.ru
				                	ORCID iD: 0000-0002-1154-6151
				                	SPIN 代码: 4334-5520
																		                								
Cand. Sci. (Eng.), Associate Professor, Head, Department of Computer and Information Security, Institute of Artificial Intelligence
俄罗斯联邦, Moscow参考
- Charugin V.V., Chesalin A.N. Analysis and formation of network traffic datasets for computer attack detection. International Journal of Open Information Technologies. 2023. Vol. 11. No. 6. (In Rus.)
 - Busko N.A., Fedorchenko E.V., Kotenko I.V. Automatic evaluation of exploits based on deep learning methods. Ontology of designing. 2024. (In Rus.)
 - Li Y., Li X., Wu H. et al. Everything you wanted to know about LLM-based vulnerability detection but were afraid to ask. 2025. doi: 10.48550/arXiv.2504.13474.
 - Liu C., Chen X., Li X. et al Making vulnerability prediction more practical: prediction, categorization, and localization. Information and Software Technology. 2024. Vol. 171.
 - Drozdov V.A., Yakovlev O.V. Application of large language models for vulnerability analysis. Scientific aspect, № 6-2024 – Inform. Technologies. 2024. (In Rus.)
 - Charugin V.V., Charugin V.V., Chesalin A.N., Ushkova N.N. Constructor of natural language processing blocks and its application to log structuring in information security. International Journal of Open Information Technologies. 2024. Vol. 12. No. 9. (In Rus.)
 - Ridoy S.Z., Shaon M.S.H., Cuzzocrea A. et al. EnStack: An ensemble stacking framework of large language models for enhanced vulnerability detection in source code. 2024. doi: 10.48550/arXiv.2411.1656.
 - Sultan M.F., Karim T., Shaon M.S.H. et al. A combined feature embedding tools for multi-class software defect and identification. 2024. doi: 10.48550/arXiv.2411.17621.
 - Feng Z., Guo D., Tang D. et al CodeBERT: A pre-trained model for programming and natural languages. 2020. doi: 10.48550/arXiv.2002.08155.
 - Guo D., Ren S., Lu S. et al. GraphCodeBERT: Pre-training code representations with data flow. 2020. doi: 10.48550/arXiv.2009.08366.
 - Guo D., Lu S., Duan N. et al. UniXcoder: Unified cross-modal pre-training for code representation. 2022. doi: 10.48550/arXiv.2203.03850.
 - Karthik K., Moharir M., Jeevan S. et al. Temporal analysis and Common Weakness Enumeration (CWE) code prediction for software vulnerabilities using machine learning. In: 8th International Conference on Computational System and Information Technology for Sustainable Solutions. 2024.
 - Li Z., Zou D., Xu S. et al. VulDeePecker: A deep learning-based system for vulnerability detection. 2018. doi: 10.48550/arXiv.1801.01681.
 - Zheng T., Liu H., Xu H. et al. Few-VulD: A few-shot learning framework for software vulnerability detection. Computers & Security. 2024. Vol. 144.
 - Bhandari G.P., Naseer A., Moonen L. CVEfixes: Automated collection of vulnerabilities and their fixes from open-source software. 2021. doi: 10.48550/arXiv.2107.08760.
 - Pereira D.G., Afonso A., Medeiros F.M. Overview of friedman’s test and post-hoc analysis. Communication in Statistics – Simulation and Computation. 2015. Vol. 44.
 - Pohlert T. PMCMR: Calculate pairwise multiple comparisons of mean rank sums. 2016. doi: 10.32614/CRAN.package.PMCMR.
 
				
			
						
						
					
						
						





