Effectively Combining Risk Evaluation Metrics for Precise Fault Localization
Abstract
:1. Introduction
- Two positively correlated maximal risk evaluation metrics are already good at locating software faults. The effectiveness of their combination might outweigh their individual fault localization effectiveness.
- Two negatively correlated maximal risk evaluation metrics are good at locating different sorts of faults. Their combined performances might outweigh their individual fault localization effectiveness.
- Two positively correlated non-maximal risk evaluation metrics are not good at locating software faults, yet their combination might increase the overall fault localization effectiveness.
- Two negatively correlated non-maximal risk evaluation metrics are not good at locating different sorts of faults, yet combining them might increase overall fault localization effectiveness.
- Two positively or negatively correlated maximal and non-maximal risk evaluation metrics, where one is good, and the other is not good at locating software faults, might complement each other, and their combination might outweigh their individual fault localization effectiveness.
- An empirical study is performed to explore the performance and combination of different risk evaluation metrics based on whether two risk evaluation metrics are maximal or non-maximal, correlated positively or negatively, and the magnitude of the correlation. We then evaluate which combination is effective for fault localization.
- Experimental results demonstrate that negatively correlated risk evaluation metrics are best combined for effective fault localization, especially those with negatively high correlation values because they can outperform both the existing risk evaluation metrics and the individual ones that make the combination.
- This is the first study to use the degree of correlation of the same fault localization family, spectrum-based fault localization, to suggest the ideal techniques to be combined.
2. Spectrum-Based Fault Localization
3. Combining the Risk Evaluation Metrics
3.1. Selection of Suspiciousness Metrics for Combination
- (1)
- Both risk evaluation metrics are maximal.
- (2)
- Both risk evaluation metrics are non-maximal.
- (3)
- One risk evaluation metric is maximal, and the other is non-maximal.
- (1)
- Both risk evaluation metrics have high (H) correlation ().
- (2)
- Both risk evaluation metrics have moderate (M) correlation ().
- (3)
- Both risk evaluation metrics have low (L) correlation ().
- (4)
- Both risk evaluation metrics have negligible (N) correlation ().
- (5)
- Both risk evaluation metrics have neutral (U) correlation ().
Algorithm 1 Metrics Selection. | |
Require:risk evaluation metrics (I), faulty and non-faulty programs (F) | |
▹ list of metrics | |
▹ list of the 30 selected faulty and non-faulty programs | |
for each do | |
for each do | |
compute wasted efforts for each statement | |
end for | |
end for | |
for all do | |
compute the Pearson correlation (r) | |
end for | |
for all r do | |
compare the correlation of (i), and group in twos accordingly | |
extract the possible combinations as suggested above | |
group the possible combinations | |
end for | |
return grouped risk evaluation metrics |
3.2. Combination Method
Algorithm 2 Metrics Combination. | |
Require:paired risk evaluation metrics (I), faulty and non-faulty programs (F) | |
▹ list of paired metrics | |
▹ list of the faulty programs | |
for each do | |
for each do | |
compute suspiciousness scores for each statement using metric ‘a’ | |
compute suspiciousness scores for each statement using metric ‘b’ | |
end for | |
end for | |
for all do | |
normalize using MinMaxScaler() | |
normalized metric ‘a’ | |
normalized metric ‘b’ | |
end for | |
for each do | |
for each do | |
▹ Final Suspiciousness score | |
end for | |
end for | |
return Final Rank |
4. Experimental Design
4.1. Fault Localization
4.2. Dataset
- Small:
- Projects with ≤10 KLOC, such as Flex, Sed, Grep, and Gzip with an average executed statements of 3037, and have 92 faults.
- Medium:
- Projects >10 KLOC ≤50 KLOC, such as Lang and Time with an average executed statements of 5725, and have 92 faults.
- Large:
- Projects >50 KLOC, such as Math, Closure, and Chart with an average executed statements of 14,333, and contain 265 faults.
4.3. Research Questions
- RQ1.
- RQ2.
- How do the best performing combined risk evaluation metrics compare against the performance of standalone maximal and non-maximal risk evaluation metrics? We select the best-performing risk evaluation metrics among the combined (maximal and maximal, non-maximal and maximal, and non-maximal and non-maximal) risk evaluation metrics to answer this research question. This is done by comparing all the combined risk evaluation metrics with each other. If a combined risk evaluation metric outperforms all other ones in two or more categories of the partitioned dataset, it is assumed to be better than the other risk evaluation metrics. We, therefore, select such a metric to represent the group of combinations. This resulted in 6 combined risk evaluation metrics compared with 14 maximal and 16 non-maximal risk evaluation metrics. In total, we compared 36 metrics.
- RQ3.
- Is there any statistical performance difference between the combined and standalone maximal and non-maximal risk evaluation metrics? This research question statistically analyses the overall performance differences between the combined and existing risk evaluation metrics. We combine all the datasets for this experiment. We set the experiment to iterate fifteen times using each risk evaluation metric to compute the average wasted effort for each fault (see Section 4.4.3 for details). The experiment is instrumented to automatically exclude five faults per iteration to obtain a different wasted effort value for each risk evaluation metric in each iteration. We then used Wilcoxon signed-rank test to test the statistical differences and Cliff’s delta to test the effect sizes using the scores computed above. This aims to examine the significance of the performance difference between the combined and existing risk evaluation metrics.
4.4. Evaluation Metrics
4.4.1. Exam Score
4.4.2. acc@n
4.4.3. Average Wasted Effort (AWE)
4.5. Tie Breaking
4.6. Statistical Tests
4.6.1. Wilcoxon Signed-Rank Test
4.6.2. Cliff’s Delta
- there is no difference in the performance of two risk evaluation metrics, and they are essentially the same.
- there is a small difference in the performance of two risk evaluation metrics.
- there is a medium difference in the performance of two risk evaluation metrics.
- there is a large difference in the performance of two risk evaluation metrics.
5. Results
5.1. RQ1: Which Combined Metrics Perform the Best among the Combinations?
- Negative Correlation
- Positive Correlation
- Answer to RQ1:
5.2. RQ2: How Do the Best Performing Combined Risk Evaluation Metrics Compare against the Performance of Standalone Maximal and Non-Maximal Risk Evaluation Metrics?
- Small Programs:
- The small program benchmarks assessed the performance of the combined risk evaluation metrics and the existing ones. This analysis once again proved that some existing risk evaluation metrics are optimal in small programs, such as OP2. The OP2 risk evaluation metric is more effective in the small program in terms of Exam and wasted effort than all the combined risk evaluation metrics. Furthermore, the risk evaluation metric, which comprises OP2 and Wong1, performs like OP2 but better than Wong1. Similarly, apart from positively-correlated two maximal risk evaluation metrics, all other combinations outperformed the individual risk evaluation metric combined to form them in small programs.
- Medium Programs:
- The combined risk evaluation metric outperformed all other combined-risk evaluation metrics and the existing risk evaluation metrics in terms of Exam in the medium program. In terms of wasted effort, the OP2 risk evaluation metric shows more effective performance than all other risk evaluation metrics. The combined-risk evaluation metric, which outperformed other risk evaluation metrics in terms of Exam, comprises Jaccard and Harmonic-Mean. This combined-risk evaluation metric outperformed the standalone risk evaluation metrics used to form it. Similarly, apart from two highly correlated maximal risk evaluation metrics and low negatively correlated non-maximal risk evaluation metrics, all other combinations outperformed their standalone risk evaluation metric combined to form them.
- Large Programs:
- The combined-risk evaluation metric, , comprising two lowly correlated non-maximal risk evaluation metrics, Rogot1 and Barinnel, outperformed all other studied risk evaluation metrics in the large programs in terms of Exam and wasted effort. All other combined-risk evaluation metrics outperformed the individual existing-risk evaluation metrics that make up the combined-risk evaluation metric, except the negative negligibly-correlated two non-maximal risk evaluation metrics in terms of Exam. Therefore, lowly-correlated two non-maximal risk evaluation metrics are suitable for large programs.
- Answer to RQ2:
5.3. RQ3: Is There Any Statistical Performance Difference between the Combined and Standalone Maximal and Non-Maximal Risk Evaluation Metrics?
- Answer to RQ3:
5.4. Discussion
- Two positively correlated maximal metrics are already good at locating software faults. Their combination can not outweigh their performance and can not enhance fault localization.
- Negatively correlated two maximal risk evaluation metrics are good at locating different sorts of faults. Their combined performances can outweigh their individual fault localization effectiveness, provided their degree of correlation is moderate or high.
- Two positively correlated non-maximal metrics are not good at locating software faults, yet their combination can increase the overall fault localization effectiveness, provided their degree of correlation is low or neutral.
- Two negatively correlated non-maximal metrics are not good at locating software faults, and their combination can not improve fault localization effectiveness.
- Two negatively correlated maximal and non-maximal metrics, where one is good, and the other is not good at locating software faults, can complement each other provided they have low correlation, and their combination can outweigh their individual fault localization effectiveness.
6. Threats to Validity
- Construct validity: Threat to construct validity relates to the program granularity used for spectrum-based fault localization. In this study, we localize the faults at the statement level. Since this is the smallest possible granularity level and captures the program behaviour at a deficient level, this risk to construct validity is minimized. Furthermore, this is also in line with existing works. Many previous studies have localized faults at the statement granularity.
- Internal validity: The threat to internal validity relates to the evaluation metrics used to compare different risk evaluation metrics. One technique might be better than the other for a particular evaluation metric. Therefore, we use three evaluation metrics: the Exam score, Average Wasted Effort, and acc@n, which concern different aspects. Previous studies have also used these metrics [12,13,16,44]. Since each metric is concerned with different aspects used in previous studies, this threat is reasonably mitigated.
- External validity: The evaluation of risk evaluation metrics in this study depends on the dataset of subject programs used and may not be generalizable. Indeed, the results and findings of many fault localization studies are not directly generalizable.The threat to external validity is our method to determine which risk evaluation metrics are suitable for combination. We initially computed performance correlations between different risk evaluation metrics on 30 randomly selected faults from SIR-repository and Defects4J. It is highly likely to obtain different correlation results for another set of randomly selected faults from a different dataset.Nonetheless, this study has suggested the best metrics suitable for combination for effective fault localization.
7. Conclusions
- Finding 1:
- The highest negatively correlated value of two maximal risk evaluation metrics obtained is -0.129. The combination outperformed the two combined risk evaluation metrics but did not outperform the best existing risk evaluation metric in this study, OP2. This means there are very high chances of getting an effective fault localization performance from two highly correlated maximal risk evaluation metrics with a value of at least −0.70 and above.
- Finding 2:
- Practically, combining maximal and non-maximal risk evaluation metrics with moderate or high correlation power, whether positive or negative, can only outperform one individual risk evaluation metric, but not the two risk evaluation metrics. It is best to consider their low correlation power for effective fault localization.
- Finding 3:
- Combining two non-maximal risk evaluation metrics, with high or moderate and positive or negative correlation power can not outperform the individual risk evaluation metric. Contrarily, the low or negligible positive correlation power of these risk evaluation metric can be considered for effective fault localization.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jones, J.A.; Harrold, M.J.; Stasko, J. Visualization of test information to assist fault localization. In Proceedings of the 24th International Conference on Software Engineering, ICSE 2002, Orlando, FL, USA, 25 May 2002; pp. 467–477. [Google Scholar]
- Abreu, R.; Zoeteweij, P.; Golsteijn, R.; Van Gemund, A.J.C. A practical evaluation of spectrum-based fault localization. J. Syst. Softw. 2009, 82, 1780–1792. [Google Scholar] [CrossRef]
- Wong, C.P.; Santiesteban, P.; Kästner, C.; Le Goues, C. VarFix: Balancing edit expressiveness and search effectiveness in automated program repair. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2021; pp. 354–366. [Google Scholar]
- Ye, H.; Martinez, M.; Durieux, T.; Monperrus, M. A comprehensive study of automatic program repair on the QuixBugs benchmark. J. Syst. Softw. 2021, 171, 110825. [Google Scholar] [CrossRef]
- Zou, D.; Liang, J.; Xiong, Y.; Ernst, M.D.; Zhang, L. An Empirical Study of Fault Localization Families and Their Combinations. IEEE Trans. Softw. Eng. 2019, 47, 332–347. [Google Scholar] [CrossRef] [Green Version]
- Srivastava, S. A Study on Spectrum Based Fault Localization Techniques. J. Comput. Eng. Inf. Technol. 2021, 4, 2. [Google Scholar]
- Ghosh, D.; Singh, J. Spectrum-based multi-fault localization using Chaotic Genetic Algorithm. Inf. Softw. Technol. 2021, 133, 106512. [Google Scholar] [CrossRef]
- Jiang, J.; Wang, R.; Xiong, Y.; Chen, X.; Zhang, L. Combining spectrum-based fault localization and statistical debugging: An empirical study. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA, 11–15 November 2019; pp. 502–514. [Google Scholar]
- Dallmeier, V.; Lindig, C.; Zeller, A. Lightweight defect localization for java. In European Conference on Object-Oriented Programming; Springer: Berlin/Heidelberg, Germany, 2005; pp. 528–550. [Google Scholar]
- Ajibode, A.; Shu, T.; Said, K.; Ding, Z. A Fault Localization Method Based on Metrics Combination. Mathematics 2022, 10, 2425. [Google Scholar] [CrossRef]
- Wong, C.P.; Xiong, Y.; Zhang, H.; Hao, D.; Zhang, L.; Mei, H. Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, 29 Septembe–3 October 2014; pp. 181–190. [Google Scholar]
- Yoo, S. Evolving human competitive spectra-based fault localisation techniques. In International Symposium on Search Based Software Engineering; Springer: Berlin/Heidelberg, Germany, 2012; pp. 244–258. [Google Scholar]
- Ajibode, A.A.; Shu, T.; Ding, Z. Evolving Suspiciousness Metrics From Hybrid Data Set for Boosting a Spectrum Based Fault Localization. IEEE Access 2020, 8, 198451–198467. [Google Scholar] [CrossRef]
- Wu, T.; Dong, Y.; Lau, M.F.; Ng, S.; Chen, T.Y.; Jiang, M. Performance Analysis of Maximal Risk Evaluation Formulas for Spectrum-Based Fault Localization. Appl. Sci. 2020, 10, 398. [Google Scholar] [CrossRef] [Green Version]
- Heiden, S.; Grunske, L.; Kehrer, T.; Keller, F.; Van Hoorn, A.; Filieri, A.; Lo, D. An evaluation of pure spectrum-based fault localization techniques for large-scale software systems. Softw. Pract. Exp. 2019, 49, 1197–1224. [Google Scholar] [CrossRef]
- Wong, W.E.; Debroy, V.; Gao, R.; Li, Y. The DStar method for effective software fault localization. IEEE Trans. Reliab. 2013, 63, 290–308. [Google Scholar] [CrossRef]
- Naish, L.; Lee, H.J.; Ramamohanarao, K. A model for spectra-based software diagnosis. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2011, 20, 1–32. [Google Scholar] [CrossRef]
- Liblit, B.; Naik, M.; Zheng, A.X.; Aiken, A.; Jordan, M.I. Scalable statistical bug isolation. Acm Sigplan Not. 2005, 40, 15–26. [Google Scholar] [CrossRef] [Green Version]
- Chen, M.Y.; Kiciman, E.; Fratkin, E.; Fox, A.; Brewer, E. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the Proceedings International Conference on Dependable Systems and Networks, Washington, DC, USA, 23–26 June 2002; pp. 595–604. [Google Scholar]
- Abreu, R.; Zoeteweij, P.; Van Gemund, A.J.C. An evaluation of similarity coefficients for software fault localization. In Proceedings of the 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06), Riverside, CA, USA, 18–20 December 2006; pp. 39–46. [Google Scholar]
- Wong, W.E.; Qi, Y.; Zhao, L.; Cai, K.Y. Effective fault localization using code coverage. In Proceedings of the 31st Annual International Computer Software and Applications Conference (COMPSAC 2007), Beijing, China, 24–27 July 2007; Volume 1, pp. 449–456. [Google Scholar]
- Yoo, S.; Xie, X.; Kuo, F.C.; Chen, T.Y.; Harman, M. Human competitiveness of genetic programming in spectrum-based fault localisation: Theoretical and empirical analysis. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2017, 26, 1–30. [Google Scholar] [CrossRef]
- Wang, S.; Lo, D.; Jiang, L.; Lau, H.C. Search-based fault localization. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), Lawrence, KS, USA, 6–10 November 2011; pp. 556–559. [Google Scholar]
- Xuan, J.; Monperrus, M. Learning to Combine Multiple Ranking Metrics for Fault Localization. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, 6 December 2014; pp. 191–200. [Google Scholar] [CrossRef] [Green Version]
- Zhang, X.Y.; Jiang, M. SPICA: A Methodology for Reviewing and Analysing Fault Localisation Techniques. In Proceedings of the 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), Luxembourg, 27 September–1 October 2021; pp. 366–377. [Google Scholar]
- Xie, X.; Chen, T.Y.; Kuo, F.C.; Xu, B. A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2013, 22, 1–40. [Google Scholar] [CrossRef]
- Xie, X.; Kuo, F.C.; Chen, T.Y.; Yoo, S.; Harman, M. Provably optimal and human-competitive results in sbse for spectrum based fault localisation. In International Symposium on Search Based Software Engineering; Springer: Berlin/Heidelberg, Germany, 2013; pp. 224–238. [Google Scholar]
- B. Le, T.D.; Lo, D.; Le Goues, C.; Grunske, L. A learning-to-rank based fault localization approach using likely invariants. In Proceedings of the 25th International Symposium on Software Testing and Analysis, Saarbrücken, Germany, 18–20 July 2016; pp. 177–188. [Google Scholar]
- Ernst, M.D.; Cockrell, J.; Griswold, W.G.; Notkin, D. Dynamically discovering likely program invariants to support program evolution. IEEE Trans. Softw. Eng. 2001, 27, 99–123. [Google Scholar] [CrossRef]
- Just, R.; Jalali, D.; Ernst, M.D. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, San Jose, CA, USA, 21–25 July 2014; pp. 437–440. [Google Scholar]
- Sohn, J.; Yoo, S. Fluccs: Using code and change metrics to improve fault localization. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, Santa Barbara, CA, USA, 10–14 July 2017; ACM: New York, NY, USA, 2017; pp. 273–283. [Google Scholar]
- Kim, Y.; Mun, S.; Yoo, S.; Kim, M. Precise learn-to-rank fault localization using dynamic and static features of target programs. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2019, 28, 1–34. [Google Scholar] [CrossRef]
- Jones, J.A.; Harrold, M.J. Empirical Evaluation of the Tarantula Automatic Fault-Localization Technique. In Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, Long Beach, CA, USA, 7–11 November 2005; Association for Computing Machinery: New York, NY, USA, 2005. ASE ’05. pp. 273–282. [Google Scholar] [CrossRef] [Green Version]
- Ochiai, A. Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions. Bull. Jpn. Soc. Sci. Fish. 1957, 22, 526–530. [Google Scholar] [CrossRef] [Green Version]
- Choi, S.S.; Cha, S.H.; Tappert, C.C. A survey of binary similarity and distance measures. J. Syst. Cybern. Informatics 2010, 8, 43–48. [Google Scholar]
- Van de Vijver, F.J.; Leung, K. Methods and Data Analysis for Cross-Cultural Research; Cambridge University Press: Cambridge, UK, 2021; Volume 116. [Google Scholar]
- Golagha, M.; Pretschner, A.; Briand, L.C. Can we predict the quality of spectrum-based fault localization? In Proceedings of the 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST), Porto, Portugal, 24–28 October 2020; pp. 4–15. [Google Scholar]
- Zheng, W.; Hu, D.; Wang, J. Fault localization analysis based on deep neural network. Math. Probl. Eng. 2016, 2016. [Google Scholar] [CrossRef] [Green Version]
- Lo, D.; Jiang, L.; Budi, A. Comprehensive evaluation of association measures for fault localization. In Proceedings of the 2010 IEEE International Conference on Software Maintenance, Timisoara, Romania, 12–18 September 2010; pp. 1–10. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Rothermel, G.; Elbaum, S.; Kinneer, A.; Do, H. Software-artifact infrastructure repository. 2006. Available online: http://sir.unl.edu/portal (accessed on 10 December 2020).
- Zhang, X.; Gupta, N.; Gupta, R. Locating faults through automated predicate switching. In Proceedings of the 28th International Conference on Software Engineering, Shanghai, China, 20–28 May 2006; pp. 272–281. [Google Scholar]
- Laghari, G.; Murgia, A.; Demeyer, S. Fine-Tuning Spectrum Based Fault Localisation with Frequent Method Item Sets. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, Singapore, 3–7 September 2016; Association for Computing Machinery: New York, NY, USA, 2016. ASE 2016. pp. 274–285. [Google Scholar] [CrossRef]
- Pearson, S.; Campos, J.; Just, R.; Fraser, G.; Abreu, R.; Ernst, M.D.; Pang, D.; Keller, B. Evaluating and improving fault localization. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), Buenos Aires, Argentina, 20–28 May 2017; pp. 609–620. [Google Scholar]
- Just, R.; Parnin, C.; Drosos, I.; Ernst, M.D. Comparing developer-provided to user-provided tests for fault localization and automated program repair. In Proceedings of the ISSTA 2018, Proceedings of the 2018 International Symposium on Software Testing and Analysis, Amsterdam, The Netherlands, 16–21 July 2018; pp. 287–297.
- Chen, Z.; Kommrusch, S.J.; Tufano, M.; Pouchet, L.; Poshyvanyk, D.; Monperrus, M. SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Trans. Softw. Eng. 2019, 47, 1943–1959. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Z.; Chan, W.K.; Tse, T.H.; Jiang, B.; Wang, X. Capturing propagation of infected program states. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Amsterdam, The Netherlands, 24–28 August 2009; pp. 43–52. [Google Scholar]
- Debroy, V.; Wong, W.E.; Xu, X.; Choi, B. A grouping-based strategy to improve the effectiveness of fault localization techniques. In Proceedings of the 2010 10th International Conference on Quality Software, Zhangjiajie, China, 14–15 July 2010; pp. 13–22. [Google Scholar]
- Keller, F.; Grunske, L.; Heiden, S.; Filieri, A.; van Hoorn, A.; Lo, D. A critical evaluation of spectrum-based fault localization techniques on a large-scale software system. In Proceedings of the 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), Prague, Czech Republic, 25–29 July 2017; pp. 114–125. [Google Scholar]
- de Souza, H.A.; Chaim, M.L.; Kon, F. Spectrum-based software fault localization: A survey of techniques, advances, and challenges. arXiv 2016, arXiv:1607.04347. [Google Scholar]
- Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Cliff, N. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 1993, 114, 494. [Google Scholar] [CrossRef]
- Romano, J.; Kromrey, J.D.; Coraggio, J.; Skowronek, J.; Devine, L. Exploring methods for evaluating group differences on the NSSE and other surveys: Are the t-test and Cohen’sd indices the most appropriate choices. In Proceedings of the Annual Meeting of the Southern Association for Institutional Research, Arlington, USA, 14–17 October 2006; pp. 1–51. [Google Scholar]
- Zhang, L.; Yan, L.; Zhang, Z.; Zhang, J.; Chan, W.; Zheng, Z. A theoretical analysis on cloning the failed test cases to improve spectrum-based fault localization. J. Syst. Softw. 2017, 129, 35–57. [Google Scholar] [CrossRef]
- Ribeiro, H.L.; de Araujo, P.R.; Chaim, M.L.; de Souza, H.A.; Kon, F. Evaluating data-flow coverage in spectrum-based fault localization. In Proceedings of the 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Porto de Galinhas, Brazil, 19–20 September 2019; pp. 1–11. [Google Scholar]
- Vancsics, B.; Szatmári, A.; Beszédes, Á. Relationship between the effectiveness of spectrum-based fault localization and bug-fix types in javascript programs. In Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), London, ON, Canada, 18–21 February 2020; pp. 308–319. [Google Scholar]
Name | Definition |
---|---|
Tarantula [33] | |
Ample [9] | |
Ochiai1 [34] | |
Jaccard [19] | |
Ochiai2 [34] | |
Kulczynski1 [35] | |
OP2 [17] | |
D*2 [11] | |
GP02 [12] | |
GP03 [12] | |
GP19 [12] | |
Wong1 [21] | |
Wong2 [21] | |
Wong3 [21] |
OP2 | Tarantula | Ochiai1 | Ample | Jaccard | D*2 | GP02 | Dice | Rogers-Tanimoto | SEM3 | Barinnel | Hamman | SEM1 | SEM2 | Rogot1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Ochiai2 | 0.005 | ||||||||||||||
D*2 | |||||||||||||||
GP02 | 0.005 | ||||||||||||||
GP03 | 0.439 | ||||||||||||||
GP19 | 0.537 | −0.002 | |||||||||||||
Wong1 | 0.999 | ||||||||||||||
Wong2 | −0.119 | −0.129 | |||||||||||||
Wong3 | −0.119 | −0.129 | |||||||||||||
Euclid | −0.371 | −0.007 | |||||||||||||
Dice | 0.860 | ||||||||||||||
Rogers-Tanimoto | −0.456 | ||||||||||||||
SEM3 | 0.860 | ||||||||||||||
Russel&Rao | 0.999 | ||||||||||||||
Barinnel | 0.633 | 0.468 | |||||||||||||
M1 | 0.005 | ||||||||||||||
Harmonic-Mean | 0.472 | ||||||||||||||
Scott | −0.414 | ||||||||||||||
SEM1 | −0.361 | ||||||||||||||
Rogot1 | 0.468 | ||||||||||||||
M2 | 0.005 | ||||||||||||||
Cohen | 0.597 | ||||||||||||||
Fleiss | −0.985 |
GP19 | Rogers-Tanimoto | Wong1 | Russel&Rao | GP19 (n) | Rogers-Tanimoto (n) | Wong1 (n) | Russel&Rao (n) | Randomly | Purposely | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 2 | 0 | 0 | 4.899 | 0.111 | 2.000 | 0.200 | 1.000 | 0.000 | 1.000 | 1.000 | 2.000 | 1.000 | |
8 | 2 | 0 | 0 | 4.899 | 0.111 | 2.000 | 0.200 | 1.000 | 0.000 | 1.000 | 1.000 | 2.000 | 1.000 | |
4 | 2 | 4 | 0 | 2.828 | 0.429 | 2.000 | 0.200 | 0.577 | 0.572 | 1.000 | 1.000 | 2.000 | 1.149 | |
1 | 0 | 7 | 2 | 0.000 | 0.538 | 0.000 | 0.000 | 0.000 | 0.768 | 0.000 | 0.000 | 0.000 | 0.768 | |
3 | 2 | 5 | 0 | 4.000 | 0.538 | 2.000 | 0.200 | 0.816 | 0.768 | 1.000 | 1.000 | 2.000 | 1.584 | |
2 | 2 | 6 | 0 | 4.899 | 0.667 | 2.000 | 0.200 | 1.000 | 1.000 | 1.000 | 1.000 | 2.000 | 2.000 | |
4 | 0 | 4 | 2 | 0.000 | 0.250 | 0.000 | 0.000 | 0.000 | 0.250 | 0.000 | 0.000 | 0.000 | 0.250 | |
4 | 0 | 4 | 2 | 0.000 | 0.250 | 0.000 | 0.000 | 0.000 | 0.250 | 0.000 | 0.000 | 0.000 | 0.250 | |
3 | 0 | 5 | 2 | 0.000 | 0.333 | 0.000 | 0.000 | 0.000 | 0.399 | 0.000 | 0.000 | 0.000 | 0.399 | |
1 | 0 | 7 | 2 | 0.000 | 0.538 | 0.000 | 0.000 | 0.000 | 0.768 | 0.000 | 0.000 | 0.000 | 0.768 | |
0 | 0 | 8 | 2 | 0.000 | 0.667 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 1.000 | |
8 | 2 | 0 | 0 | 4.899 | 0.111 | 2.000 | 0.200 | 1.000 | 0.000 | 1.000 | 1.000 | 2.000 | 1.000 |
Negative | Positive | |||||
---|---|---|---|---|---|---|
Group | Metric Combination | Combined Metrics | Correlation | Acronyms | Combined | Correlation |
1 | Tarantula and Wong2 | High | OP2 and Wong1 | High | ||
Tarantula and Wong3 | Moderate | GP19 and Dstar | Moderate | |||
Ample and Wong2 | Low | Dstar and GP03 | Low | |||
Ample and Wong3 | Negligible | Ochiai2 and GP02 | Negligible | |||
2 | Wong3 and SEM1 | High | OP2 and Russel | High | ||
Rogers and Tarantula | Moderate | Tarantula and Barinel | Moderate | |||
Tarantula and Euclid | Low | Jaccard and Harmonic | Low | |||
GP19 and Rogers-Tanimoto | Negligible | Dstar and M1 | Negligible | |||
3 | Fleiss and SEM1 | High | Dice and SEM3 | High | ||
Scott and SEM1 | Moderate | SEM2 and Cohen | Moderate | |||
Rogers and SEM1 | Low | Rogot1 and Barinel | Low | |||
Euclid and SEM3 | Negligible | Hamman and M2 | Negligible |
Small programs (3037 LOC) | Medium programs (5725 LOC) | Large programs (14,333 LOC) | |||||
---|---|---|---|---|---|---|---|
Group | Combined Metrics | Exam | AWE | Exam | AWE | Exam | AWE |
Negative | 4.51 | 205 | 8.43 | 326 | 12.1 | 2163 | |
4.51 | 205 | 8.43 | 326 | 12.1 | 2163 | ||
6.88 | 305 | 9.56 | 351 | 15.82 | 2780 | ||
6.88 | 305 | 9.56 | 351 | 15.82 | 2780 | ||
23.78 | 750 | 25.18 | 886 | 35.96 | 5357 | ||
7.20 | 235 | 8.72 | 322 | 14.45 | 2521 | ||
5.31 | 220 | 7.03 | 318 | 11.31 | 2057 | ||
7.91 | 281 | 11.44 | 396 | 17.95 | 3032 | ||
13.71 | 405 | 12.56 | 560 | 13.52 | 2493 | ||
25.41 | 799 | 13.38 | 578 | 15.45 | 2818 | ||
25.14 | 785 | 32.06 | 1020 | 41.64 | 5631 | ||
4.62 | 212 | 10.14 | 397 | 11.92 | 2300 | ||
Positive | 4.13 | 130 | 7.05 | 389 | 8.32 | 1989 | |
15.12 | 1104 | 9.14 | 562 | 10.1 | 2877 | ||
18.31 | 1214 | 20.93 | 646 | 20.48 | 3140 | ||
12.02 | 905 | 6.69 | 319 | 10.46 | 2020 | ||
4.61 | 150 | 7.05 | 389 | 8.32 | 1989 | ||
4.52 | 149 | 6.50 | 511 | 7.04 | 2532 | ||
4.18 | 133 | 6.20 | 379 | 7.70 | 2023 | ||
20.45 | 708 | 35.39 | 870 | 42.19 | 5243 | ||
4.95 | 194 | 6.31 | 510 | 7.11 | 2528 | ||
5.02 | 207 | 7.48 | 374 | 8.83 | 2034 | ||
4.95 | 211 | 6.63 | 344 | 6.65 | 1545 | ||
5.11 | 224 | 7.44 | 322 | 11.79 | 2207 |
Small programs (3037 LOC) | Medium programs (5725 LOC) | Large programs (14,333 LOC) | |||||
---|---|---|---|---|---|---|---|
Group | Combined Metrics | Exam | AWE | Exam | AWE | Exam | AWE |
Combined | 4.51 | 205 | 8.43 | 326 | 12.1 | 2163 | |
5.31 | 220 | 7.03 | 318 | 11.31 | 2057 | ||
4.62 | 212 | 10.14 | 397 | 11.92 | 2300 | ||
4.13 | 130 | 7.05 | 389 | 8.32 | 1989 | ||
4.18 | 133 | 6.20 | 379 | 7.70 | 2023 | ||
4.95 | 211 | 6.63 | 344 | 6.65 | 1545 | ||
Maximal | OP2 | 4.12 | 130 | 7.20 | 190 | 8.67 | 1999 |
Tarantula | 12.97 | 417 | 14.45 | 363 | 18.75 | 2190 | |
Ochiai1 | 12.21 | 402 | 14.45 | 361 | 18.77 | 2159 | |
Ochiai2 | 6.80 | 238 | 6.85 | 364 | 8.77 | 2203 | |
Ample | 8.47 | 298 | 9.06 | 320 | 9.46 | 1558 | |
Jaccard | 5.82 | 156 | 7.26 | 390 | 8.42 | 2158 | |
DStar | 29.77 | 1098 | 14.05 | 440 | 13.64 | 2625 | |
GP02 | 24.05 | 758 | 20.18 | 343 | 23.44 | 1698 | |
GP03 | 12.43 | 345 | 42.57 | 503 | 34.12 | 2170 | |
GP19 | 4.99 | 159 | 10.85 | 401 | 12.08 | 2389 | |
Wong1 | 8.63 | 368 | 12.3 | 621 | 14.29 | 3325 | |
Wong2 | 15.92 | 498 | 39.26 | 605 | 45.62 | 2945 | |
Wong3 | 15.92 | 498 | 39.26 | 605 | 45.62 | 2945 | |
Kulczynski1 | 30.12 | 1111 | 14.27 | 441 | 13.72 | 2626 | |
Non-maximal | SEM1 | 30.29 | 990 | 16.60 | 479 | 19.55 | 2679 |
Euclid | 15.92 | 498 | 39.26 | 605 | 45.62 | 2945 | |
Fleiss | 11.90 | 403 | 16.25 | 254 | 27.11 | 2157 | |
Scott | 21.96 | 711 | 33.23 | 463 | 41.67 | 2757 | |
Rogers-Tanimoto | 15.92 | 498 | 39.26 | 605 | 45.62 | 2945 | |
SEM3 | 6.70 | 238 | 8.67 | 363 | 11.04 | 2214 | |
Russel&Rao | 8.63 | 368 | 12.30 | 621 | 14.29 | 3325 | |
Barinnel | 12.97 | 417 | 8.44 | 361 | 9.37 | 2159 | |
Harmonic-Mean | 6.56 | 230 | 8.83 | 402 | 11.60 | 2059 | |
M2 | 5.28 | 176 | 8.72 | 389 | 10.03 | 2390 | |
Dice | 5.89 | 190 | 8.26 | 380 | 9.42 | 2298 | |
Cohen | 6.13 | 213 | 9.79 | 199 | 11.23 | 2080 | |
Rogot1 | 35.72 | 1244 | 16.31 | 520 | 14.43 | 2178 | |
Hamman | 15.92 | 498 | 39.26 | 605 | 45.62 | 2945 | |
SEM2 | 6.56 | 230 | 9.61 | 380 | 11.51 | 2032 | |
M1 | 15.89 | 498 | 16.16 | 254 | 10.89 | 2290 |
Small program (%) | Medium program (%) | Large program (%) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metrics | acc@1 | acc@3 | acc@5 | acc@10 | acc@1 | acc@3 | acc@5 | acc@10 | acc@1 | acc@3 | acc@5 | acc@10 | |
Combined | 16 | 24 | 30 | 41 | 18 | 30 | 43 | 52 | 15 | 25 | 28 | 37 | |
10 | 12 | 12 | 20 | 18 | 30 | 43 | 53 | 15 | 25 | 29 | 37 | ||
16 | 15 | 35 | 28 | 22 | 40 | 51 | 58 | 21 | 30 | 33 | 41 | ||
16 | 24 | 30 | 43 | 21 | 34 | 42 | 50 | 15 | 23 | 26 | 34 | ||
16 | 24 | 30 | 43 | 18 | 35 | 45 | 53 | 16 | 25 | 29 | 37 | ||
12 | 18 | 20 | 28 | 17 | 34 | 45 | 54 | 16 | 25 | 29 | 38 | ||
Maximal | OP2 | 16 | 24 | 30 | 43 | 17 | 30 | 40 | 49 | 16 | 23 | 26 | 34 |
Tarantula | 6 | 8 | 9 | 10 | 17 | 30 | 43 | 52 | 16 | 23 | 28 | 36 | |
Ochiai1 | 6 | 6 | 8 | 10 | 17 | 30 | 44 | 54 | 16 | 24 | 28 | 36 | |
Ochiai2 | 16 | 21 | 24 | 36 | 16 | 30 | 43 | 52 | 16 | 24 | 28 | 36 | |
Ample | 17 | 24 | 30 | 41 | 17 | 29 | 37 | 46 | 15 | 21 | 24 | 31 | |
Jaccard | 16 | 24 | 30 | 41 | 15 | 32 | 44 | 55 | 17 | 25 | 29 | 37 | |
DStar | 2 | 8 | 10 | 18 | 13 | 28 | 37 | 45 | 14 | 20 | 24 | 30 | |
GP02 | 0 | 1 | 1 | 3 | 6 | 15 | 18 | 21 | 3 | 4 | 4 | 7 | |
GP03 | 13 | 21 | 25 | 32 | 4 | 6 | 11 | 13 | 1 | 4 | 4 | 5 | |
GP19 | 17 | 25 | 26 | 35 | 15 | 22 | 32 | 39 | 11 | 17 | 19 | 24 | |
Wong1 | 0 | 1 | 1 | 2 | 2 | 9 | 12 | 16 | 1 | 3 | 4 | 7 | |
Wong2 | 16 | 21 | 25 | 30 | 9 | 11 | 17 | 22 | 8 | 13 | 14 | 18 | |
Wong3 | 16 | 21 | 25 | 30 | 9 | 11 | 17 | 22 | 8 | 13 | 14 | 18 | |
Kulczynski1 | 2 | 8 | 10 | 18 | 13 | 27 | 38 | 45 | 14 | 21 | 25 | 30 | |
Non-maximal | SEM1 | 0 | 0 | 0 | 1 | 1 | 3 | 11 | 15 | 0 | 2 | 4 | 5 |
Euclid | 10 | 18 | 23 | 28 | 9 | 11 | 17 | 22 | 8 | 13 | 14 | 18 | |
Fleiss | 10 | 18 | 24 | 29 | 16 | 29 | 43 | 51 | 17 | 24 | 26 | 36 | |
Scott | 5 | 5 | 7 | 9 | 12 | 17 | 26 | 34 | 11 | 18 | 19 | 23 | |
Rogers-Tanimoto | 10 | 24 | 29 | 30 | 9 | 11 | 17 | 22 | 8 | 13 | 14 | 18 | |
SEM3 | 11 | 23 | 29 | 30 | 15 | 33 | 43 | 54 | 16 | 24 | 28 | 37 | |
Russel&Rao | 0 | 1 | 1 | 2 | 2 | 9 | 12 | 16 | 1 | 3 | 4 | 7 | |
Barinel | 6 | 8 | 9 | 10 | 17 | 30 | 44 | 54 | 16 | 24 | 28 | 36 | |
Harmonic-Mean | 11 | 24 | 30 | 29 | 17 | 29 | 41 | 51 | 16 | 23 | 26 | 35 | |
M2 | 12 | 25 | 31 | 31 | 17 | 30 | 40 | 49 | 16 | 23 | 26 | 34 | |
Dice | 10 | 20 | 25 | 29 | 15 | 32 | 44 | 55 | 17 | 25 | 29 | 37 | |
Cohen | 9 | 19 | 25 | 29 | 15 | 30 | 43 | 54 | 17 | 24 | 28 | 36 | |
Rogot1 | 2 | 9 | 11 | 20 | 13 | 28 | 37 | 44 | 13 | 20 | 24 | 30 | |
Hamman | 12 | 24 | 29 | 28 | 9 | 11 | 17 | 22 | 8 | 13 | 14 | 18 | |
SEM2 | 13 | 24 | 30 | 27 | 17 | 29 | 41 | 51 | 16 | 23 | 26 | 35 | |
M1 | 5 | 8 | 5 | 30 | 16 | 24 | 29 | 40 | 12 | 17 | 25 | 28 |
Maximal and Maximal | Non-Maximal and Maximal | Non-Maximal and Non-Maximal | |||||
---|---|---|---|---|---|---|---|
Metrics | |||||||
Maximal | OP2 | 1.00 | 0.58 | 0.78 | 0.58 | 1.00 | −0.10 |
> | > | > | > | > | < | ||
Tarantula | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Ochiai1 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Ochiai2 | −1.00 | −1.00 | −1.00 | −1.00 | 0.15 | −1.00 | |
< | < | < | < | > | < | ||
Ample | 0.94 | 0.85 | 0.88 | 0.85 | 1.00 | −0.44 | |
> | > | > | > | > | < | ||
Jaccard | −1.00 | −1.00 | −1.00 | −1.00 | 1.00 | −1.00 | |
< | < | < | < | > | < | ||
DStar | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
GP02 | −1.00 | −1.00 | −1.00 | −1.00 | 1.00 | −1.00 | |
< | < | < | < | > | < | ||
GP03 | −1.00 | −1.00 | −1.00 | −1.00 | −0.65 | −1.00 | |
< | < | < | < | < | < | ||
GP19 | −1.00 | −1.00 | −1.00 | −1.00 | −0.95 | −1.00 | |
< | < | < | < | < | < | ||
Wong1 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Wong2 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Wong3 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Kulczynski1 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Non-maximal | SEM1 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 |
< | < | < | < | < | < | ||
Euclid | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Fleiss | −1.00 | −1.00 | −1.00 | −1.00 | 1.00 | −1.00 | |
< | < | < | < | > | < | ||
Scott | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Rogers-Tanimoto | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
SEM3 | −1.00 | −1.00 | −1.00 | −1.00 | 1.00 | −1.00 | |
< | < | < | < | > | < | ||
Russel&Rao | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Barinnel | −1.00 | −1.00 | −1.00 | −1.00 | 1.00 | −1.00 | |
< | < | < | < | < | < | ||
Harmonic-Mean | −0.92 | −0.82 | −0.85 | −0.82 | 1.00 | −0.40 | |
< | < | < | < | > | < | ||
M2 | −1.00 | −1.00 | −1.00 | −1.00 | 1.00 | −1.00 | |
< | > | < | < | < | < | ||
Dice | −0.94 | −0.84 | −0.88 | −0.85 | 0.96 | −0.44 | |
< | < | < | < | > | < | ||
Cohen | 0.15 | 0.10 | 0.12 | 0.10 | 0.45 | −0.88 | |
> | > | > | > | > | < | ||
Rogot1 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
Hamman | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < | ||
SEM2 | 0.20 | −0.73 | −0.75 | −0.70 | 0.32 | −0.31 | |
> | < | < | < | > | < | ||
M1 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | −1.00 | |
< | < | < | < | < | < |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ajibode, A.; Shu, T.; Gulsher, L.; Ding, Z. Effectively Combining Risk Evaluation Metrics for Precise Fault Localization. Mathematics 2022, 10, 3924. https://doi.org/10.3390/math10213924
Ajibode A, Shu T, Gulsher L, Ding Z. Effectively Combining Risk Evaluation Metrics for Precise Fault Localization. Mathematics. 2022; 10(21):3924. https://doi.org/10.3390/math10213924
Chicago/Turabian StyleAjibode, Adekunle, Ting Shu, Laghari Gulsher, and Zuohua Ding. 2022. "Effectively Combining Risk Evaluation Metrics for Precise Fault Localization" Mathematics 10, no. 21: 3924. https://doi.org/10.3390/math10213924
APA StyleAjibode, A., Shu, T., Gulsher, L., & Ding, Z. (2022). Effectively Combining Risk Evaluation Metrics for Precise Fault Localization. Mathematics, 10(21), 3924. https://doi.org/10.3390/math10213924