

Code smells are poorly designed code structures indicating that the code may need to be refactored. Recognizing code smells in practice is complex, and researchers strive to develop automatic code smell detectors. An obstacle to developing these solutions is the datasets’ limitations. Manually labeled datasets were collected to investigate the developers’ perceptions of code smells. They are characterized by a high label disagreement that hurts the performance of Machine Learning (ML) models trained using them. Furthermore, all large, manually labeled datasets are developed for Java. We recently created a novel dataset for C# to alleviate these issues. This paper evaluates ML code smell detection approaches on our novel dataset. We consider two feature representations to train ML models: (1) code metrics and (2) CodeT5 embeddings. This study is the first to consider the CodeT5 state-of-the-art neural source code embedding for code smell detection in C#. To prove the effectiveness of ML, we consider multiple metrics-based heuristics as alternatives. In our experiments, the best-performing approach was the ML classifier trained on code metrics (F-measure of 0.87 for Long Method and 0.91 for Large Class detection). However, the performance improvement over CodeT5 features is negligible if we consider the advantages of automatically inferring features. Finally, our ML model surpassed less experienced annotators and nearly matched the most experienced annotator, suggesting it can assist less experienced developers under tight deadlines. To the best of our knowledge, this is the first study to compare the performance of automatic smell detectors against human performance. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
| Engineering controlled terms: | Computer software selection and evaluationEncoding (symbols)Large datasetsMachine learningOdors |
|---|---|
| Engineering uncontrolled terms | Code embeddingCode metricsCode smellEmbeddingsLabeled datasetMachine learning modelsMachine-learningPerformanceSoftware QualitySource code metrics |
| Engineering main heading: | Embeddings |
| Funding sponsor | Funding number | Acronym |
|---|---|---|
| Science Fund of the Republic of Serbia | 6521051 | |
| Science Fund of the Republic of Serbia | ||
| 451-03-47/2023-01/200156 |
This research was supported by the Science Fund of the Republic of Serbia, Grant No 6521051, AI-Clean CaDET and the Ministry of Science, Technological Development and Innovation through project no. 451-03-47/2023-01/200156 \u201CInnovative scientific and artistic research from the FTS (activity) domain\u201D.
Slivka, J.; Department of Computing and Control Engineering, Faculty of Technical Sciences, University of Novi Sad, Trg Dositeja Obradovića 6, Novi Sad, Serbia;
© Copyright 2024 Elsevier B.V., All rights reserved.