

Code smells are structures in code that may indicate maintainability issues. They are challenging to define, and software engineers detect them differently. Mitigation of this problem could be an AI code smell detector. However, to develop it, we need a standardized benchmark dataset. Existing datasets suffer from (1) annotation subjectivity, (2) lack of ground-truth consensus among annotators, and (3) reproducibility issues. This paper aims to develop a systematic manual code smell annotation procedure that addresses these issues. We tailored the prescriptive natural language processing annotation methodology to code smell detection: (1) we cross-validate annotations to mitigate subjectivity, (2) we develop clear annotation guidelines to reach the ground-truth consensus, and (3) we follow literature recommendations for reproducibility and open-source our tools and dataset. We extracted the annotation guidelines from existing empirical code smell research. The annotators refined the guidelines and their understanding of the task through proof-of-concept annotation encompassing retrospective discussion and disagreement resolution and then performed full annotation. We confirmed that the ground-truth consensus was reached by measuring annotation consistency. Our contributions are the proposed annotation procedure, a novel code smell dataset of open-source C# projects, the annotators' experience report, and the open-sourced supporting tool. © 2023 Elsevier B.V.
| Engineering controlled terms: | Natural language processing systemsOdorsOpen source softwareOpen systems |
|---|---|
| Engineering uncontrolled terms | Benchmark datasetsCode smellGround truthManual annotationManual codesNatural languagesNovel datasetOpen-sourceReproducibilitiesSoftware Quality |
| Engineering main heading: | Computer software selection and evaluation |
| Funding sponsor | Funding number | Acronym |
|---|---|---|
| 451-03-47/2023-01/200156 | ||
| Science Fund of the Republic of Serbia | 6521051 |
This research was supported by the Science Fund of the Republic of Serbia , Grant No. 6521051 , AI-Clean CaDET and the Ministry of Science, Technological Development and Innovation through project no. 451-03-47/2023-01/200156 “Innovative scientific and artistic research from the FTS (activity) domain.” Our funders had no involvement in the study design, collection, analysis, and interpretation of the data, writing of the report, or the decision to submit the article for publication.
Slivka, J.; Department of Computing and Control Engineering, Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia;
© Copyright 2023 Elsevier B.V., All rights reserved.