Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Deep Learning Techniques for Differentiating Human-Written and AI-Generated Source Codes
0
Zitationen
3
Autoren
2026
Jahr
Abstract
The rapid growth of AI applications capable of writing or rewriting Java software has posed significant problems to teachers and software development divisions, as they now cannot tell whether the code was actually written by a human or generated by an AI. As a solution for this, the objective of the study is to come up with a robust classifier to recognize AI-generated code in Java. The data sample was balanced, comprising of twenty thousand samples of the GPTGCJ Java dataset, which included both human generated and AI generated programs. Following the preprocessing and preparation of the data, four feature extraction methods were used to represent various properties of the code. These were Abstract Syntax Tree features, token-based features, TF-IDF representations and code embeddings. All of the feature types were subsequently tested with three models: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and XGBoost. To guarantee sound and reliable results, all twelve models were tested using percentage split and k -fold crossvalidation. Among all combinations that were tried, the CNN model that used code embedding features demonstrated the best accuracy of 94.97 % and was the most balanced. The findings demonstrate the need to choose the appropriate feature extraction method and model in the analysis of AI-generated code. The suggested methodology is a practical and efficient way of detecting AI-generated Java code both in the academic and the professional world.