Code Smells Quantification: A Case Study on Large Open Source Research Codebase

Swapnil Singh Chauhan, University of Texas at El Paso


Research software has opened up new pathways of discovery in many and diverse disciplines. The research software is developed under unique budgetary and schedule constraints. The developers are often untrained transient workforce of graduate students and postdocs. As a result, the software quality hinders its sustainability beyond the immediate research goals. More importantly, the prevalent reward structures favor contributions in terms of research articles and systematically undervalues research code contributions. As a result, researchers and funding agencies do not allocate appropriate efforts or resources to the development, sustenance, and dissemination of research codebases. At the same time, there are no uniform methodology to quantify codebase sustainability. Current methods adopt metrics with fixed thresholds that often do not appropriately characterize codebase quality and sustainability. In this thesis, we conduct a case study to investigate this phenomenon. We analyze a large-scale research codebase over a five-year period. For reference, we compare the research codebase quality characteristics to a reference codebase developed by Google engineers over the same period of time. The case study suggests that both research and professional codebases quality tends to degrade over time, but the decline is much more prominent in research codebases. Similarly, the study found that the number of code quality violations as a percentage of the codebase is significantly higher for the research codebase. The study also reveals that there are quality characteristics that are unique to professional codebases. For example, professionals tend to design software with smaller code units possibly in anticipation that such units will grow in size over time. On the other hand, research codebases’ units are significantly larger and become increasingly difficult to maintain over time. The results of this thesis are published in 2019 IEEE/ACM 14th International Workshop on Software Engineering for Science (SE4Science). This thesis is organized as follows. Chapter 1 provides background on code quality and presents a number of code quality metrics. This chapter also presents the two codebases used in the study. Chapter 2 presents related works pertaining to software engineers’ perceptions of software quality, the evolution and impact of code quality metrics, methods to quantify code quality, and a review of key methodologies and tools to identify and quantify code quality. Chapter 3 presents the case study design and introduces the quantification metrics developed for the study. Chapter 4 presents the results, analysis, and key findings. We conclude the thesis and outline future work in Chapter 5.

Subject Area

Computer science

Recommended Citation

Chauhan, Swapnil Singh, "Code Smells Quantification: A Case Study on Large Open Source Research Codebase" (2019). ETD Collection for University of Texas, El Paso. AAI13857380.