Locating Source Code Bugs in Software Information Systems Using Information Retrieval Techniques
Abstract
:1. Introduction
- How well does the segmentation of the source code body and comments affect the bug localization process?
- Does our developed POS method improve the bug localization process?
- To what extent could the synonyms of bug reports and comments improve the bug localization process?
- Does our developed approach improve the accuracy of bug localization?
2. Theoretical Background
2.1. Bug Report
2.2. Bug Localization Techniques
- Static techniques: These depend on the structure of source code. These techniques assist the developer to locate an individual program element like classes and methods. The static approach can be performed at any development phase, and mainly depends on IR techniques that represent source code as a textual corpus to be searched against the query, which is a bug report.
2.3. Preprocessing Techniques
- Stop Words Removal: very frequent stop words like (the, a, and, or, and so on) are useless and add noise as they do not discriminate documents against each other, so stop words should be removed.
- Text Normalization: splitting text into tokens, for example, an identifier named “FirstVariable” is tokenized to First and Variable according to camel case notation.
- Stemming: removing affixes and suffixes to extract the root of terms [13].
2.4. Information Retrieval Models
- Vector Space Model (VSM): represents queries and documents as vectors of terms’ weights. The weight is usually expressed in terms of (TF-IDF) of the corresponding term. Term frequency (TF) indicates the number of term occurrences in the document, while inverse document frequency (IDF) indicates the number of documents that contain the term in a corpus. The higher the TF and IDF of a word, the more significant the term would be. Eventually, a higher weight will be assigned [12].
- Smoothed Unigram Model (SUM): is a probabilistic model that ranks the documents based on the probabilities to generate all query terms. This model is smoothed to alleviate zero probability for every document, so it is called the smoothed unigram model (SUM) [12].
- Latent Semantic Indexing Model (LSI): this model exploits singular value decomposition (SVD) to reduce dimensional space generated by documents from the term–document matrix, sometimes called LSA (latent semantic analysis) [12].
- Latent Dirichlet Allocation Model (LDA): is a probabilistic topic model that presents the documents and queries according to their topic. Each topic is denoted as a vector of terms, where each document is denoted as a vector of the topic [12].
3. Previous Works
Spectrum-Based Bug Localization (Dynamic Approaches)
4. Methodology
4.1. Data Sampling
- AspectJ 1.0, aspect-oriented programming (AOP): An aspect programming paradigm that aims to leverage modularity by aspect separation and adding extra behavior to specific code without code modification, using various extensions of Java programming language.
- Eclipse 3.1, integrated development environment (IDE): This is an integrated development environment (IDE) used widely in programming software applications and systems. Eclipse is written in Java and mainly used to develop powerful Java applications and systems. It offers the ability to develop other programming applications using customized plugins, including Ada, ABAP, C, C++, C#, Clojure, COBOL, D, Erlang, Fortran, Groovy, Haskell, JavaScript, and Julia.
4.2. Converting Software and Bug Reports into XML Format
4.3. Source Code Segmentation
4.3.1. Extracting Software Artifacts
4.3.2. Natural Language Preprocessing
- Stop words removal: these words deteriorate the understanding of documents’ meaning and they are not valuable as index terms, so any stop words like articles and pronouns are removed. Moreover, as the programming language keywords, i.e., break for, char, and default, are only essential to run the program and do not provide any relevant information to the bug report, we remove Java keywords.
- Tokenization: every compound word in the documents is divided into its component; “ConsoleView” after tokenization will generate the console and view. Tokenization is necessary to increase the relevant valuable terms. Applying the tokenization process leverages the indexing process, and thus enables a higher match between the bug report and the document corresponding to that tokenized term; rather, there is no synonym generation tool can generate synonyms for non-tokenized terms.
- if the word ends in “ing”, remove the “ing”, for example, “processing” after stemming will be “process”
- if the word ends in “ly”, remove the “ly”, for example, “frinedly” after stemming will be “friend”
- if the word ends in “ed”, remove the “ed”, for example, “happened” after stemming will be “happen”.
4.3.3. Natural Language Understanding (Part-of-Speech and Synonyms’ Generation)
- Rule#1: if the bug description or summary contains noun + suffix “.Java”, then the preceding noun will be highly recommended to be the buggy file name.
- Rule#2: if the bug description or summary contains noun + (), then the preceding noun will be highly recommended to be the name of the method that causes such bugs.
- Rule#3: if the bug description or summary contains the words “fetal error” + pronouns, then the following noun will be highly recommended to be the buggy file name.
- Rule#4: if the bug description or summary contains the symbol “[“ + noun + ”]”, then the noun involved within brackets will be highly recommended to be the buggy file/method name.
- Rule#5: if the bug description or summary contains noun + symbol “#“ + noun, then this noun will be highly recommended to be the buggy service name.
- Rule#6: if the bug description or summary contains noun + symbol “/“, then this will be highly recommended to be the path of the buggy source file.
4.3.4. Ranking and Retrieval Model
4.3.5. Data Analysis and Interpretation
4.4. Experiments and Results
4.5. Evaluation and Discussion
5. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hanandeh, F.; Saifan, A.A.; Akour, M.; Al-Hussein, N.K.; Shatnawi, K.Z. Evaluating Maintainability of Open Source Software: A Case Study. Int. J. Open Source Softw. Process. (IJOSSP) 2017, 8, 1–20. [Google Scholar] [CrossRef]
- Tantithamthavorn, C.; Abebe, S.L.; Hassan, A.E.; Ihara, A.; Matsumoto, K. The Impact of IR-based Classifier Configuration on the Performance and the Effort of Method-Level Bug Localization. Inf. Softw. Technol. 2018, 102, 160–174. [Google Scholar] [CrossRef] [Green Version]
- Zhou, J.; Zhang, H.; Lo, D. Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports. In Proceedings of the 2012 34th International Conference on Software Engineering (ICSE), Zurich, Switzerland, 2–9 June 2012; pp. 14–24. [Google Scholar]
- Khatiwada, S.; Tushev, M.; Mahmoud, A. Just enough semantics: An information theoretic approach for ir-based software bug localization. Inf. Softw. Technol. 2018, 93, 45–57. [Google Scholar] [CrossRef]
- Aljawarneh, S.A.; Alawneh, A.; Jaradat, R. Cloud security engineering: Early stages of SDLC. Future Gener. Comput. Syst. 2017, 74, 385–392. [Google Scholar] [CrossRef]
- Dilshener, T.; Wermelinger, M.; Yu, Y. Locating bugs without looking back. Autom. Softw. Eng. 2018, 25, 383–434. [Google Scholar] [CrossRef] [Green Version]
- Huang, Y.; Huang, S.; Chen, H.; Chen, X.; Zheng, Z.; Luo, X.; Jia, N.; Hu, X.; Zhou, X. Towards automatically generating block comments for code snippets. Inf. Softw. Technol. 2020, 127, 106373. [Google Scholar] [CrossRef]
- Newman, C.D.; AlSuhaibani, R.S.; Decker, M.J.; Peruma, A.; Kaushik, D.; Mkaouer, M.W.; Hill, E. On the generation, structure, and semantics of grammar patterns in source code identifiers. J. Syst. Softw. 2020, 170, 110740. [Google Scholar] [CrossRef]
- Moreno, L.; Treadway, J.J.; Marcus, A.; Shen, W. On the use of stack traces to improve text retrieval-based bug localization. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, 29 September–3 October 2014; pp. 151–160. [Google Scholar]
- Saha, R.K.; Lease, M.; Khurshid, S.; Perry, D.E. Improving bug localization using structured information retrieval. In Proceedings of the 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), Silicon Valley, CA, USA, 11–15 November 2013; pp. 345–355. [Google Scholar]
- Davies, S.; Roper, M.; Wood, M. Using bug report similarity to enhance bug localisation. In Proceedings of the 2012 19th Working Conference on Reverse Engineering, Kingston, ON, Canada, 15–18 October 2012; pp. 125–134. [Google Scholar]
- Wang, S.; Liu, T.; Tan, L. Automatically learning semantic features for defect prediction. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), Austin, TX, USA, 14–22 May 2016; pp. 297–308. [Google Scholar]
- Rahman, S.; Ganguly, K.K.; Sakib, K. An improved bug localization using structured information retrieval and version history. In Proceedings of the 2015 18th International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh, 21–23 December 2015; pp. 190–195. [Google Scholar]
- Chakraborty, S.; Li, Y.; Irvine, M.; Saha, R.; Ray, B. Entropy Guided Spectrum Based Bug Localization Using Statistical Language Model. arXiv 2018, arXiv:1802.06947. [Google Scholar]
- Sisman, B.; Kak, A.C. Incorporating version histories in information retrieval based bug localization. In Proceedings of the 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), Zurich, Switzerland, 2–3 June 2012; pp. 50–59. [Google Scholar]
- Beard, M. Extending bug localization using information retrieval and code clone location techniques. In Proceedings of the 2011 18th Working Conference on Reverse Engineering, Limerick, Ireland, 17–20 October 2011; pp. 425–428. [Google Scholar]
- Gharibi, R.; Rasekh, A.H.; Sadreddini, M.H. Locating relevant source files for bug reports using textual analysis. In Proceedings of the 2017 International Symposium on Computer Science and Software Engineering Conference (CSSE), Shiraz, Iran, 25–27 October 2017; pp. 67–72. [Google Scholar]
- Wong, C.P.; Xiong, Y.; Zhang, H.; Hao, D.; Zhang, L.; Mei, H. Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, 29 September–3 October 2014; pp. 181–190. [Google Scholar]
- Youm, K.C.; Ahn, J.; Lee, E. Improved bug localization based on code change histories and bug reports. Inf. Softw. Technol. 2017, 82, 177–192. [Google Scholar] [CrossRef]
- Davies, S.; Roper, M. Bug localisation through diverse sources of information. In Proceedings of the 2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Pasadena, CA, USA, 4–7 November 2013; pp. 126–131. [Google Scholar]
- Alduailij, M.; Al-Duailej, M. Performance evaluation of information retrieval models in bug localization on the method level. In Proceedings of the 2015 International Conference on Collaboration Technologies and Systems (CTS), Atlanta, GA, USA, 1–5 June 2015; pp. 305–313. [Google Scholar]
- Lukins, S.K.; Kraft, N.A.; Etzkorn, L.H. Source code retrieval for bug localization using latent dirichlet allocation. In Proceedings of the 2008 15th Working Conference on Reverse Engineering, Antwerp, Belgium, 15–18 October 2008; pp. 155–164. [Google Scholar]
- Uneno, Y.; Mizuno, O.; Choi, E. Using a Distributed Representation of Words in Localizing Relevant Files for Bug Reports. In Proceedings of the 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), Vienna, Austria, 1–3 August 2016; pp. 183–190. [Google Scholar] [CrossRef]
- Lam, A.N.; Nguyen, A.T.; Nguyen, H.A.; Nguyen, T.N. Bug localization with combination of deep learning and information retrieval. In Proceedings of the 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), Buenos Aires, Argentina, 22–23 May 2017; pp. 218–229. [Google Scholar]
- Xiao, Y.; Keung, J.; Bennin, K.E.; Mi, Q. Machine translation-based bug localization technique for bridging lexical gap. Inf. Softw. Technol. 2018, 99, 58–61. [Google Scholar] [CrossRef]
- Xiao, Y.; Keung, J.; Bennin, K.E.; Mi, Q. Improving bug localization with word embedding and enhanced convolutional neural networks. Inf. Softw. Technol. 2019, 105, 17–29. [Google Scholar] [CrossRef]
- Lam, A.N.; Nguyen, A.T.; Nguyen, H.A.; Nguyen, T.N. Combining deep learning with information retrieval to localize buggy files for bug reports (n). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; pp. 476–481. [Google Scholar]
- Dao, T.; Zhang, L.; Meng, N. How does execution information help with information-retrieval based bug localization? In Proceedings of the 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), Buenos Aires, Argentina, 22–23 May 2017; pp. 241–250. [Google Scholar]
- Malhotra, R.; Aggarwal, S.; Girdhar, R.; Chugh, R. Bug localization in software using NSGA-II. In Proceedings of the 2018 IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 28–29 April 2018; pp. 428–433. [Google Scholar]
- Zou, J.; Xu, L.; Yang, M.; Zhang, X.; Zeng, J.; Hirokawa, S. Automated duplicate bug report detection using multi-factor analysis. Ieice Trans. Inf. Syst. 2016, 99, 1762–1775. [Google Scholar] [CrossRef] [Green Version]
- Gupta, A.; Suri, B.; Kumar, V.; Misra, S.; Blažauskas, T.; Damaševičius, R. Software Code Smell Prediction Model Using Shannon, Rényi and Tsallis Entropies. Entropy 2018, 20, 372. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kumari, M.; Misra, A.; Misra, S.; Fernandez Sanz, L.; Damasevicius, R.; Singh, V.B. Quantitative Quality Evaluation of Software Products by Considering Summary and Comments Entropy of a Reported Bug. Entropy 2019, 21, 91. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Khurma, R.A.; Alsawalqah, H.; Aljarah, I.; Elaziz, M.A.; Damaševičius, R. An Enhanced Evolutionary Software Defect Prediction Method Using Island Moth Flame Optimization. Mathematics 2021, 9, 1722. [Google Scholar] [CrossRef]
- Saifan, A.A.; Obeidat, L. Feature Location Enhancement Based on Source Code Augmentation with Synonyms of Terms. Softw. Pract. Exp. 2021, 51, 235–259. [Google Scholar] [CrossRef]
- Hanna, S.; Alawneh, A. An Approach of Web Service Quality Attributes Specification. Commun. IBIMA 2010, 2010, 13. Available online: http://www.ibimapublishing.com/journals/CIBIMA/cibima.html (accessed on 15 September 2022). [CrossRef]
- Hanna, S.; Alawneh, A.A. An ontology for the quality attributes of web services. Knowledge Management and Innovation in Advancing Economies: Analyses and Solutions. In Proceedings of the 13th International Business Information Management Association Conference, Marrakech, Morocco, 9–10 November 2009; Volume 3, pp. 1348–1358. [Google Scholar]
- Al-Shawakfa, E. A Rule-based Approach to Understand Questions in Arabic Question Answering. Jordanian J. Comput. Inf. Technol. 2016, 2, 210–231. [Google Scholar]
- Alazzam, I. Using Information Retrieval to Improve Integration Testing. Ph.D. Thesis, North Dakota State University, Fargo, ND, USA, 2012. [Google Scholar]
- Wang, S.; Lo, D. Version history, similar report, and structure: Putting them together for improved bug localization. In Proceedings of the 22nd International Conference on Program Comprehension, Hyderabad, India, 2–3 June 2014; pp. 53–63. [Google Scholar]
Dataset | Description | # of Source File | # of Bug Report |
---|---|---|---|
SWT | Java widget toolkit | 484 | 98 |
AspectJ | Java aspect-oriented extension | 6485 | 286 |
Eclipse | Open-source software for Java development | 12,836 | 3075 |
With Segmentation | ||
Software | Recall | Precision |
SWT | 74.65% | 75.62% |
AspectJ | 79.05% | 72.25% |
Eclipse | 74.25% | 76.05% |
Without Segmentation | ||
Software | Recall | Precision |
SWT | 55.80% | 55.36% |
AspectJ | 53.77% | 55.33% |
Eclipse | 43.29% | 55.77% |
With POS | ||
Software | Recall | Precision |
SWT | 65.92% | 68.89% |
AspectJ | 66.14% | 63.11% |
Eclipse | 63.11% | 69.13% |
Without POS | ||
Software | Recall | Precision |
SWT | 48.80% | 63.20% |
AspectJ | 43.68% | 49.96% |
Eclipse | 55.96% | 63.60% |
With Synonyms and POS | ||
Software | Recall | Precision |
SWT | 74.65% | 75.62% |
AspectJ | 79.05% | 72.25% |
Eclipse | 74.25% | 76.05% |
With Synonyms Only | ||
Software | Recall | Precision |
SWT | 59.11% | 66.33% |
AspectJ | 54.35% | 55.20% |
Eclipse | 43.87% | 52.01% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alawneh, A.; Alazzam, I.M.; Shatnawi, K. Locating Source Code Bugs in Software Information Systems Using Information Retrieval Techniques. Big Data Cogn. Comput. 2022, 6, 156. https://doi.org/10.3390/bdcc6040156
Alawneh A, Alazzam IM, Shatnawi K. Locating Source Code Bugs in Software Information Systems Using Information Retrieval Techniques. Big Data and Cognitive Computing. 2022; 6(4):156. https://doi.org/10.3390/bdcc6040156
Chicago/Turabian StyleAlawneh, Ali, Iyad M. Alazzam, and Khadijah Shatnawi. 2022. "Locating Source Code Bugs in Software Information Systems Using Information Retrieval Techniques" Big Data and Cognitive Computing 6, no. 4: 156. https://doi.org/10.3390/bdcc6040156
APA StyleAlawneh, A., Alazzam, I. M., & Shatnawi, K. (2022). Locating Source Code Bugs in Software Information Systems Using Information Retrieval Techniques. Big Data and Cognitive Computing, 6(4), 156. https://doi.org/10.3390/bdcc6040156