Audio-Based Aircraft Detection System for Safe RPAS BVLOS Operations
Round 1
Reviewer 1 Report
This paper compares many deep learning-based methods for sound event detection to the problem of aerial surveillance. Using the sound from the object, the authors proposed a number of effective methods using the existing CNN and RNNs, more specifically, YAMNet and VGG.
The paper is well written and the reviewer has no difficulty in understanding the detailed methods, databases, experimental setups, and the results. The description is detailed enough for the other researchers in the relevant fields can replicate the method very easily.
There are few recommendations to improve the quality of the paper:
- Comparing Tables 3 and 4, YAMNet is much better in classification performances as well as computation times. However, there is little or almost no explanations for the architectures of YAMNet and VGG. The references are for the different databases, so the authors should analyze it.
- The authors claimed that for BVLOS, standalone inference with very light GPU board. In page 4, line 117, the authors said the embedded computer is NVIDIA Jetson TX2. However, the reviewer cannot find any details for implementation of the proposed method on that embedded board. The configuration is quite different from the GPGPU, so the modifications for the embedded implementation is the key of the proposed method.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
The authors propose a novel system for audio-based aircraft detection system. Advantages of the article: 1) The authors did an intensive literature review to select potential best deep learning model for audio detection on a mobile platform. 2) The authors collected data needed to train a classifier and used augmentation techniques to balance the type of samples. 3) The authors trained the classifier using collected data. 4) The authors tested the performance of the system. Disadvantages of the article: 1) The system is not compared to any other from literature. 2) Description of "Hybrid" dataset is very unclear. I do not know how it was created.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Detect and Avoid systems for UAVs and other types of autonomous vehicles are interesting and relevant topics.
Obtained results seems to yield promising results. Clarity of presentation and paper structure is good although there are some comments and suggestions.
As stated by the authors as well, main problem of the paper is small test dataset. As it seems, only one type of aircraft was used for recording samples. If that is the case (clarification needed), in my opinion it is to little for making final conclusions about the potential of the proposed approach.
Also, authors should consider more thorough introduction of additional noise into the samples. This should, at least, be commented in the paper because environment noise, as well as internal noise from the UAV with sensors could have significant impact on the system performance.
Suggestion is to modify Table 3. Typing "/" between two variables implies division of the two values while here we have separated indicators. Therefore, writing TPR and FPR values in two separate columns may be more appropriate.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
At the beginning, congratulations on choosing such an interesting research topic.
Fundamental literature analysis on audio systems is sufficient however there are no references to the class of vision systems which are mentioned on lines 68-70. I suggest referring to the works in which the relationship between the viewing angle and the resolution of the vision sensors in the tasks of "Detect and Avoid" was explained in detail:
1. Petridis, S., Geyer, C., & Singh, S. (2008). Learning to detect aircraft at low resolutions. In International Conference on Computer Vision Systems (pp. 474-483). Springer, Berlin, Heidelberg.
2. Rzucidło, P., Rogalski, T., Jaromi, G., Kordos, D., Szczerba, P. and Paw, A. (2020), "Simulation studies of a vision intruder detection system", Aircraft Engineering and Aerospace Technology, Vol. 92 No. 4, pp. 621-631.
There are also no references to works in the field of laser detection.
The description of Figure 1 is too general. There is no information about the resolution of the sensor and the viewing angle.
Please provide the STFT transform parameters explicitly, and not only refer to the source codes (lines 183-184). In the [38] I found e.g. "(...) window size of 25 ms, a window hop of 10 ms (...)". Why was this window width and an overlap used exactly?
Do the authors plan to use the wavelet transform instead of STFT in the tasks of acoustic detection in future research?
The results obtained in the study refer to single engine airplane sounds recorded from the ground. Sounds occurring in the airspace and recorded from the board of another aircraft have a different form - the authors are aware of this, but they took it into account only in one short sentence (lines 289-290). This requires additional explanation. What are the expected effects and disturbances related to the sounds coming from propulsion of the aircraft carrying of the detection system? Does the air flow around the acoustic sensors preclude their practical use? The results obtained are interesting and promising, but the factors I mentioned may significantly hinder practical applications. Therefore, I propose to expand the discussion on this in a summary. Additionally, it will be reasonable to change the title to a more adequate one, for example "Preliminary Tests of Audio-Based ... (...)"
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
I have no further complaints on paper. Authors have answered to my comments in a proper manner.
Author Response
We would like to thank the reviewer once again for taking the time to review our work and for providing encouraging and constructive comments.