Date of Award
Doctor of Philosophy
Mahmud Shahriar Hossain
The rapid growth of software systems demands meticulous planning and maintenance to accommodate the evolution of the code base over extended periods. Without maintenance, software systems will become more complex, low in quality, and hence unsustainable. Software engineers who perform maintenance often strive to optimize code quality or minimize code smells in a timely manner. Several techniques have been used to detect code quality or code smells as a part of software maintenance. Most of these techniques are based on heuristics, which create detection rules using a few metrics. These approaches have reasonable accuracy but do not work in cross-project evaluation. The recent efforts in devising automatic Machine Learning (ML) based quality or code smell detection techniques have achieved unsatisfactory results so far. Reasons include the use of a smaller dataset, fewer input features, within-project classification, or a lack of user-friendly tools for data collection.
This dissertation explores the use of modern techniques in Mining Software Repositories (MSR), identifying code smells, code quality, and issue labels using machine learning approaches. The mining process is optimized through the use of phase-by-phase caching and efficient data retrieval from open-source platforms. To identify code quality attributes, traditional machine-learning approaches were applied to a large set of metrics. For the identification of code smells, traditional ML, and neural network-based ML techniques were utilized. A deep learning-based ML technique is proposed to classify the issue labeling of reported issues.
The first contribution of this dissertation is the development of a novel mining tool for extracting software artifacts. The proposed tool, ModelMine, is capable of mining software repositories, issues, and files from open-source platforms. A synthesized dataset containing code quality, issue, and code smell data is created. The second contribution is the application of ML approaches to classify code quality attributes and the comparison of their performance. The evaluation results indicate that Random Forest (RF) significantly improves accuracy without generating false negatives or false positives, which can result in false alarms in code quality classification. The third contribution is the investigation of unexpected side effects (code smells and technical debt) of software repositories. The analysis revealed that handwritten code quality is impacted by a higher level of technical debt and code smells. The results also show that the performance of neural network-based ML approaches is better than traditional ML approaches in classifying code smells. The fourth contribution of this research is the development of a deep learning approach for issue label classification. The result shows that the proposed approach outperforms classifying issue labels compared to existing research.
In conclusion, this dissertation presents a comprehensive study of the use of modern techniques in the MSR field and machine learning approaches to identify code quality, code smells, and issue labels. The results of this research have practical implications for software quality assurance and issue management. Also, it will provide a foundation for machine learning approaches in software maintenance activities. To further improve the performance of ML, I will incorporate a larger dataset into the ML models through the enhancement of the ModelMine tool. The future research plan includes developing a methodology using NLP techniques to extract insights from textual data associated with software code and investigating the use of VR-based data visualizations for software maintenance.
Recieved from ProQuest
Sayed Mohsin Reza
Reza, Sayed Mohsin, "Analyzing Software Maintenance Through Machine Learning and Mining Software Repositories Approaches" (2023). Open Access Theses & Dissertations. 3844.