This section presents the results in the order of the software defect prediction model based on an original dataset with a comparison of retaining highly correlated features and removing highly correlated features without attempting to adjust the imbalanced dataset.
4.1. Performance of Machine Learning Algorithms on Original Dataset
Table 6 presents the results obtained from the original dataset while keeping the majority of the features, and
Table 7 shows the performance of the SDP models on the original dataset on the reduced features using a removal of highly correlated features selection strategy.
Both of the tables show the project name, the machine learning techniques applied, and the evaluation metrics of accuracy, precision, recall, F1-score, and AUC, respectively. The numbers in bold format show the highest value of each of the evaluation metrics for each of these selected projects.
When all the features were retained in the datasets, the accuracy of the models was in a range between 67% and 96%. The maximum precision was 67% when the KNN algorithm was applied in the jm1 project. The jm1 project had 4574 instances, which made this dataset larger compared to the other datasets except the cross-project (cp) dataset used in this study. The CP project had a precision value ranging from 21% to 34% depending on the algorithm applied. The precision score among all the projects had a wide range starting from a zero value to a maximum score of 67%. The highest recall value found when applied on these selected projects was 40% and this score was achieved when the SVM algorithm was applied in the cp project. The maximum F1-score was 28% with zero or very low scores found from the majority of the algorithms. The highest AUC score was 60%. Relatively, the ANN algorithm and RF algorithms performed better than the other algorithms for the selected datasets.
The AUC score from most of the algorithms was around 50, apart from a few exceptions with random forest or the ANN algorithm. This does not give us confidence that the models are fully reliable as the majority of the predictions are probably biased toward predicting the non-defective module. Applying feature selection algorithms in the datasets by reducing the highly correlated features, as shown in
Table 7, did not demonstrate a significant difference from the results obtained in
Table 6, where all the features were retained. The accuracy score ranged from 72% to 95%. However, when the majority of the features were kept, the accuracy ranged from 67% to 96%. The lower bound for accuracy is higher compared to the dataset with reduced features.
The AUC score was slightly better for the pc1 project with reduced features when the random forest algorithm was applied, which showed an AUC score of 63% and an accuracy score of 95%. In terms of precision, except for the random forest model, which was able to detect all the positive instances on the cm1 project, the score varied for other projects, ranging from zero to 50%. The maximum recall value of the reduced features datasets was 29% compared to 40% when all the features were retained.
The maximum F1-score was 31% when the random forest algorithm was applied in the pc1 project with reduced features. On the other hand, the highest F1-score was 38% when the random forest algorithm was applied in the kc2 project with all features, but had a score of zero on the pc1 project when the random forest algorithm was applied. On the reduced features datasets, the random forest algorithm worked better compared to the other algorithms with relatively higher accuracy and AUC score in all projects. To see the interpretation of this random forest model on the original dataset with reduced features, the LIME technique was applied on a single instance of the pc1 project to illustrate the local explanation, whereas SHAP was applied on the same project when the random forest algorithm was applied.
Figure 3 demonstrates the local and global interpretation of the pc1 project when the random forest algorithm was applied to the dataset with reduced features.
Figure 3a demonstrates the local interpretation by the LIME technique for the selected instance. The LIME model predicted the outcome of this instance as 0, with a confidence of 77% of this instance not being defective and 23% probability of this instance being defective. The orange color shows the probability of being defective, and the blue represents the probability of being non-defective. The right-hand side of the picture shows the values of loc (line of code), l (length), LOComment, and IOBlank are contributing towards the module being defective, whereas the cyclomatic complexity, essential complexity, and effort values are contributing to the model for predicting this instance as defective. Since the total feature ranking of the model as non-defective is higher, this instance is considered as 0 or non-defective.
Figure 3b shows the global explanation of the same project when SHAP was applied. This graph shows IOBlank makes the most contribution towards the model followed by loc, effort, length, and the other features. The blue and red color distribution shows that there was no bias towards predicting defective versus non-defective modules.
4.2. Performance of Machine Learning Algorithms on Balanced Dataset
Table 8 presents the results obtained from the dataset after the oversampling technique SMOTE was applied while keeping the majority of the features, and
Table 9 shows the performance of the SDP models on the balanced dataset where highly correlated features were removed.
Both of the tables show the header table information of the project name, the machine learning techniques applied, and the evaluation metrics of accuracy, precision, recall, F1-score, and AUC, respectively, as demonstrated in the model performance on the original datasets. When all the features were retained in the datasets, the accuracy of the models was in a range between 61% and 96%. The maximum precision was 47% when the KNN algorithm was applied in the kc2 project. The maximum recall value was 71%, which is much higher than the recall value that was found in the original dataset. The F1-score ranged from 16% to 53%, whereas in the original dataset, the maximum F1-score was 28%, with zero or very low scores found from the majority of the algorithms. The highest AUC score in the balanced dataset was 77% for the pc1 project when the ANN algorithm was applied, whereas the maximum AUC score on the all features’ original dataset was 60%. The performance of the algorithms varied. For different projects with varying numbers of instances, the applied algorithms performed differently.
Applying feature selection algorithms in the datasets by reducing the highly correlated features, as shown in
Table 9, did not demonstrate a significant difference from the results obtained in
Table 8, where all the features were retained. The accuracy score ranged from 56% to 92%. However, when the majority of the features were kept, the accuracy ranged from 61% to 96%.
The minimum AUC score was 53% and the highest score for this metric was 71%. The minimum precision score was 18% and the maximum precision score was 38%. For the original dataset, several algorithms showed zero values for precision, which means the accuracy of predicting defective instances of the model was not satisfactory in the original dataset.
The recall has all positive values ranging from 15% to 71%. The F1-score also shows all positive values, unlike the original dataset, where a couple of the algorithms returned zero scores on these projects.
Table 10 demonstrates the final evaluation by summarizing the content obtained from
Table 8 and
Table 9. We considered accuracy and the AUC score before selecting a model for applying the model-agnostic technique. Although having a high AUC score does not guarantee high value for precision, recall, and the F1-score, we note from this table that when the AUC was high, the accuracy, precision, recall, and F1-score were in an acceptable range in the balanced dataset with non-zero values.
In
Table 10, the projects are listed in order of the dataset size where the number of instances available in the cleaned dataset are as shown in
Table 5. The numbers in bold format show the highest accuracy score and the AUC score of each of the projects, whereas the highlighted yellow fields represent the model where the AUC score was highest for the selected project. Since this was an imbalanced dataset, rather than prioritizing the accuracy score, we considered the classifiers with the highest AUC score as the best-performing model for the referred project. For instance, when the cm1 project with all features was considered, a higher accuracy score was observed in the ANN classifier with a value of 89% compared to an accuracy value of 64% in the KNN classifier. However, we considered the KNN classifier as the best performing model, as the AUC score was highest among all the classifiers, with a value of 68%, as our primary goal was to identify the defective modules rather than detecting non-defective modules only.
It appears in both cases of selecting all features versus selecting reduced features that the KNN algorithm performed well for relatively smaller projects, namely kc2 and cm1. Also, it is worth noting that the KNN model achieved a higher accuracy score of 75% in the cross-project model, surpassing the average accuracy of the independent project scores of 74%. On the other hand, the obtained AUC score of 56% applied in the same classifier can be treated as relatively poor. Taking into consideration the overall AUC and accuracy, the ANN model outperformed the other classifiers in the cross-project models.
The top half of this
Table 10 illustrates the accuracy and AUC score of each of the projects on the selected machine learning algorithms. These algorithms have been applied in the full-feature datasets except for the LocCodeAndComment feature. The bottom half of the figure followed the same strategy except that the SDP models were created on the reduced-feature datasets, which were obtained after removing the highly correlated independent features. The row denoted "Average" calculates the average of the accuracy and AUC score for each of the projects on the SVM, KNN, RF, and ANN models. The average score has been compared with the cross-project(CP) SDP model scores.
The SVM model performed better on the kc1 project, which consists of a dataset with 718 instances available after filters were applied. Compared to the other datasets, this one is considered mid-size (
Table 5).
In terms of the cross-project datasets, the ANN model developed on the reduced features achieved a slightly higher AUC score of 61% compared to 58%. For the pc1 project, which had mid-sized samples, the ANN classifier showed higher accuracy, with values of 95% and 92% for the all features and reduced features datasets, respectively. Although for the pc1 dataset, the same ANN model with all features showed the highest AUC of 77%, there was a reduction in the AUC score, with a value of 68% on the reduced feature dataset.
Figure 4a demonstrates the cross-project defect prediction model developed using SHAP on the ANN model incorporating all features on the balanced dataset. The highest ranking of the mean SHAP value for this project is for branchcount followed by IOBlank. This model has all the features, but the impact of the attribute effort “e” is in the lowest position compared to the other source code and size metrics. In the same ANN classifier with reduced features, SHAP was applied to provide the global explanation, as shown in
Figure 4b. This figure illustrates that in the SDP model constructed with the ANN classifier, the “loc” attribute makes the highest contribution towards the prediction outcome followed by cyclomatic complexity and intelligence. “Effort” is considered a predictor that has more importance than program length and lines of code, but less than program complexity-related features. Additionally, the “project” field is shown to have the lowest importance. This indicates that the project source, which includes cross-project information, does not significantly impact the development of this model.
Figure 5 demonstrates the interpretation of a local observation of the same ANN-based model using LIME. The LIME model predicted this record as non-defective with a confidence level of 71%. In this figure, the blue color represents the feature contribution towards being predictive as a non-defective instance compared to the orange color moving toward a defective module. The effort attribute was one of the important contributors in this prediction model; a value of less than −0.61 pushes the prediction towards a value of 0 (non-defective). Similarly, the cyclomatic complexity of the code was less than −64, and intelligence also contributes to the prediction of non-defectiveness. It seems that the attribute “intelligence”, which determines the amount of intelligence presented in the program, was lower than the given threshold of −75 to be considered defective. This local prediction explains that the selected module had a lower value than the set threshold for intelligence, cyclomatic complexity, and intelligence to be considered defective. At the same time, the same prediction shows the probability of 29% of this module being defective as the loc, l, and LOComments are higher and ev(g) is less than the given threshold. The test lead can better reflect the outcome by observing the impact of each of the attributes on this prediction. This aligns with the logical concept that software that requires less effort and is not very complex may have relatively fewer bugs compared to a complex system.
The interpretability of the kc1 project with mid-sample size (based on the number of instances available) is examined in
Figure 6 and
Figure 7. Both models interpreted with LIME and SHAP show the importance of the “effort” feature for this algorithm. The local prediction using the LIME method was able to explain the observation with 60% confidence that this module is not likely to be defective. In both models, the “loc” feature is less important compared to the effort metrics. We observe that there can be slight differences between these predictions—the global agnostic model provides generic information, while the LIME method offers insight into individual predictions.