A Flexible Code Review Framework for Combining Defect Detection and Review Comments
Round 1
Reviewer 1 Report
Authors discuss the paper of code defect detection. Initially started with binary class and then show multi class classification. However, it is not understandable about the multi class classification scenarios. It would have been great to have explanation of classes and/or construction of the classes mentioned in the text. Recent researcher used advanced algorithms other than CNN, LSTM etc such as transformer network, bi-LSTM or hybrid- combining multiple algorithm to detect code defect. It would have been great to see comparison with such advanced algorithms.
Example skip-gram model diagram was not needed (figure 4). Elaborated diagram is helpful. Model with the number of parameters could be convenient. Details descriptions of figure 5 is also helpful as this is the most central piece of the work.
Overall presentation is fine. However, few figures and examples are not necessary as mentioned above.
Author Response
Many thanks for the reviewer's professional comments. We have uploaded an attachment that has a point to point response to your comment. Thank you again for your comments and suggestions, they are very important for us to improve the quality of the manuscript.
Author Response File: Author Response.docx
Reviewer 2 Report
This paper presents a code defect detection pipeline called deep reviewer. Motivated by NLP research, the proposed pipeline utilizes LSTM for defect detection after code embedding. In addition to the defect classification results, a set of review comments is retrieved for defect reference. Please refer to my specific comments below.
(1) Generally, the pipeline works in theory. However, many technical details are missing. For example, there is no information on the review encoder and match. What architecture is used for the review encoder? What numerical criteria is used to quantify the similarity between codes? What benefit does LSTM-2 bring? Without the details, it is hard to evaluate the feasibility of the proposed pipeline.
(2) The defect detection pipeline is based on supervised learning in a closed-set setting. However, it is common that an unknown defect occurs in software. In this case, it seems that the proposed method may fail.
(3) The presentation could be better. For example, Fig. 1 is a good figure to present a systematic diagram of the proposed pipeline. But the presentation in the manuscript fails to connect Fig.1 and the text. In addition, there are many typos in the manuscript.
Author Response
Many thanks for the reviewer's professional comments. We have uploaded an attachment that has a point to point response to your comment. Thank you again for your comments and suggestions, they are very important for us to improve the quality of the manuscript.
Author Response File: Author Response.docx
Reviewer 3 Report
Thank you for the submission of the manuscript. Please review the following comments.
- The abstract is good.
- Before introducing a new framework, I recommend reviewing the existing framework or at least the closer one (Code review context). How do you come up with the proposed framework? Any basis or reference point that you rely on? I can’t see any discussion or explanation on this apart (i.e., related work)
- The background work is not comprehensive and only at the preliminary level. A thorough comparison must be made to address the needs of a new or enhanced approach that leads to the proposed framework. For instance, highlight the limitations in the table form of static, dynamic, and hybrid approaches. Similarly, to the works on text-based and machine learning based. What can be deduced from your findings in the 2nd last paragraph of section 2, the related works? Which approach will be your focal point? Justifications are not provided.
- There is no methodology section. Please provide one that explains how your work is being conducted and maps with the objectives you want to achieve.
- How did you come up with algorithm 1? Is it an improvement of [27-29]?
- Justify the usage of the evaluation metrics mentioned in section 5.1.
- Figure 6 and Figure 9 image is not clear. Difficult to comprehend.
- Discussion is missing from work. I would suggest adding this section.
- What are the work’s limitations?
Author Response
Many thanks for the reviewer's professional comments. We have uploaded an attachment that has a point to point response to your comment. Thank you again for your comments and suggestions, they are very important for us to improve the quality of the manuscript.
Author Response File: Author Response.docx
Reviewer 4 Report
There is an error in reference number one: year 2022
Author Response
Many thanks for the reviewer's professional comments. We have uploaded an attachment that has a point to point response to your comment. Thank you again for your comments and suggestions, they are very important for us to improve the quality of the manuscript.
Author Response File: Author Response.docx
Reviewer 5 Report
1- The paper starts by referring to the two infamous accidents that B-737-max had during 2018. However, it falls short to deal with aircraft software and how Authors' proposal could help prevent such accidents.
2- Authors' claim that the two crashes of B-737-Max were due to the software defect is, in fact, unprofessionally misleading and not correct
3- It appears that authors have no profound knowledge about aircraft autopilot software and its certification processes. Their claim that their work could be extended to aircraft autopilot software is clearly exaggerated.
4- I believe, authors must limit their claims to the areas such as computer vision and language processing and present their work to related computer Journals . In Aerospace applications we have a different story. Multiple software are usually working simultaneously. Of course, faulty sensors input to the software could cause a miss-match could pose a threat
5- In line 46, the contributions of the paper are in fact the proposed method. Authors must clearly identify their contributions in terms of some performance merits. As an specific example, how "flexible framework" works. Table 2 just shows that Authors proposed method has a better chance to succeed in the same domain of application
Author Response
Many thanks for the reviewer's professional comments. We have uploaded an attachment that has a point to point response to your comment. Thank you again for your comments and suggestions, they are very important for us to improve the quality of the manuscript.
Author Response File: Author Response.docx
Round 2
Reviewer 2 Report
The authors address my concerns in the manuscript. Thanks.
Author Response
Thank you for your encouragement.
Reviewer 5 Report
ÙŽ1- Authors still insist to show the importance of their work, by referring to aerospace related accidents, as such accidents, although rare (not many) , are mostly catastrophic. But, their approach is quite general. That is, no where in the work they demonstrate that their approach is suitable for aerospace applications.
2- In fact, in aerospace applications, there is no single computer program for critical jobs. They use use multiple computer programs and each program is written with a different language while they operate simultaneously. So, it would be interesting for authors to discuss, whether a specific computer-language might contribute to making mistakes and whether their proposed technique is sensitive to "computer language".
Author Response
Reviewer #5 Concern #1:
Authors still insist to show the importance of their work, by referring to aerospace related accidents, as such accidents, although rare (not many), are mostly catastrophic. But, their approach is quite general. That is, no where in the work they demonstrate that their approach is suitable for aerospace applications.
Author Response:
Many thanks for the reviewer's professional comments. Our purpose in referring some aviation accidents is to illustrate situations where software defects are common and can cause devastating disasters. Therefore, a code review process is necessary. The code review process has been implemented for years by many companies, such as Google, Microsoft, etc. The Aviation Airborne Systems and Equipment Software Development and Certification Standard (DO-178C) clearly requires a rigorous code review process for civil aircraft airborne software. Thus, while our approach is indeed generic in the task of code review. But it does have applications in the area of aerospace software review, especially in the area of civilian airborne software review with which the authors are familiar. As we mentioned in the response of the 1st round :
“Due to the high security requirements of the airborne software, the review process for it is very rigorous. This is currently done primarily manually. Studies have shown that the average speed of code review is about 150 lines per hour, and even slower for safety-critical software, including airborne software. With more complex and larger software, this experience-intensive work has significantly delayed the overall aircraft airworthiness inspection process.
Therefore, our manuscript seeks to provide reviewers and programmers with a tool that allows code defects, which are clearly defined and targeted using modification methods, to be detected quickly and automatically, thus providing efficiency in airworthiness reviews.”
Reviewer #5 Concern #2:
In fact,in aerospace applications,there is no single computer program for critical jobs. They use use multiple computer programs and each program is written with a different language while they operate simultaneously. So, it would be interesting for authors to discuss, whether a specific computer-language might contribute to making mistakes and whether their proposed technique is sensitive to "computer language”.
Author Response:
Thanks a lot for the reviewer's constructive suggestions. As far as we know, in order to prevent errors in software functionality, civilian airborne software for a particular function is often designed in such a way that different programming teams are assigned to work independently (to achieve logical independence of design), rather than using different programming languages. In the design of airborne software for civil aircraft, C/C++ language occupy a dominant position. Taking C-919 as an example, more than 95% (We are unable to provide specific reference due to the commercial confidentiality.) of its programs are written in C/C++.
C/C++ programming languages have been around for decades. To date, there have been no reports claiming that certain programming languages, including C/C++, are inherently flawed. The defects are often found in the programming and compilation process. The majority of programs for the famous F-35 fighter jet were also written in C/C++. ADA was of course one of the common programming languages used for airborne software(Especially in the Boeing family). The reason we didn't test it is that there is no open database.
In the revised manuscript, we have added a discussion of the programming language chosen for the experiment and its possible impact on the experimental results.
In the 2nd paragraph, Section 5:
In theory, the proposed approach is applicable to any programming code in text form, attributed to the learning ability of the deep learning model. The reason we chose C/C++ programing language for our experiments is its widespread use in the field of airborne software.
In the last paragraph, Section 5:
Q5: What are the limitations of this framework?
……
Although the proposed approach is theoretically applicable to other programming languages, unfortunately, we have not been able to conduct relevant experiments to verify it due to the lack of training data. This is the next step in our research direction, by adapting the preprocessor (in Fig. 1) to achieve support for different programming languages.
Author Response File: Author Response.docx