Multiscale Object Detection from Drone Imagery Using Ensemble Transfer Learning
Round 1
Reviewer 1 Report
The advantages of this article are its disadvantages. The article is too long. I understand the desire of the authors to show all the results. I believe that some of the images in the article can be removed.
So Figures 1 and 2 do not carry a special semantic load! The authors in vain give examples of images from well-known datasets!
Fig. 6-9 at a scale of 100% are uninformative.
The authors need to reduce the number of drawings and increase their scale.
Maybe we should leave only the tables!
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
The manuscript presents an ensemble transfer learning and data augmentation approach to enhance the performance of three base models. The manuscript hints at its significance for remote sensing data users but in reality it analyses general drone imagery. This is beneficial as there is a gap when it comes to drone imagery, including limited numbers of data and challenges such as acquisition at multiple heights and lighting conditions. However, the manuscript could be more significant to the remote sensing audience if it applies or even discusses some of the main findings and their implications to drone-derived orthomosaics, a product more common to remote sensing scientists.
In addition to this major issue, minor issues relate mostly to the manuscript writing style. For example, authors could avoid the use of so many acronyms that are not used frequently and check the consistent use of hyphenated words. Moreover, suggest defining UAV as uncrewed aerial vehicles (as per recent publications).
The methods section could become more clear if:
- Include size/resolution of images from datasets
- Maybe include model architectures?
- Figure 3 is reworked to better show where the imagery is coming from (include name of datasets, number of images used for training/testing) and how results are being tested.
- Include section detailing model accuracy assessment. Suggest moving first paragraph of Section 3 3.1 and 3.2 to methods.
In terms of structure, the results section could be improved if data augmentation metrics and per class results are described. Suggest moving sections 4.1 and 4.2 from discussion to results.
Discussion could improve by including how results relate to the existing body of knowledge and future scope (suggest moving this from conclusion). The conclusion section should only present a summary of main findings. Furthermore, discussing how these findings might apply to the analysis of orthomosaics derived from drone imagery would expand the significance of this study and draw more interest from the remote sensing audience.
Author Response
Please see the attachment
Author Response File: Author Response.docx
Reviewer 3 Report
This paper describes a technique for object detection based on both data augmentation and ensemble techniques. Besides the authors have pointed out a set of customized procedures to process UAVs digital images. In fact, such a context is still a challenge in the computer vision area due to the deficiency of UAV image datasets.
The authors are claiming a set of original results from their research, i.e., related to (1) experimentation with several object detection algorithms to identify their suitability for detecting various scaled objects; (2) solution to supply the lack of UAV datasets, by applying a test-time augmentation on the UAV ´images datasets to boost and verify the accuracy for the objects detection and ensemble models (ensemble method and voting strategies); (3) presentation of a framework combining multiple object detection algorithms (both single-stage and two-stage) leading to an innovative multi-technique ensemble algorithm for effectively detection of objects over a wide range of scales.
The subject of the paper is interesting and well suitable to the Journal.
However, the authors should improve the review of the literature, not only in relation of the state of the art, but also in relation of possible patents or even available commercial software’s.
Besides, they should improve the introduction of their text taking also into account concepts from embedded hardware and software for image processing and integration considering the UAV`s characteristics.
Likewise, the periods below should be better contextualized:
“In this paper, we have specifically considered the OD from drone images. [10,11,12,13].”
“A variety of improvements to DPM are reported in [18,21,24,25,26].”
“This does not consider the features
of the deep layers of the CNN, which may be useful for object classification. Also, these features are not typically useful for localizing objects. To that end, FPN was developed building semantics at all scales of resolution.”
“Although the single-stage methods have improved speed and simplicity, they typically have lesser accuracy as compared to the two-stage detectors. Lin et al. have identified the reasons for this and proposed RetinaNet [44].”
“Various improvements are proposed in [48,52,53,54,55,56,57,58,59,60,61,62].”
“Variations are also proposed in [71,72,73,74,75].”
“Variations are also proposed in [71,72,73,74,75].”
“Deep learning-based detection methods such as RCNNis employed [95,96,97,98,99,100,101,102].”
“All the existing approaches, such as DPNet-ensemble, RRNet, and ACM-OD [10] for the Visdrone dataset, proposed a custom model which was trained and tested on the dataset.”
The periods contextualization below should be better explained, since such a generalization can be not exactly like that:
“Aerial images captured by drones and other autonomous vehicles have challenged computer vision models for many reasons. The size of the objects in the captured images is often too small for accurate detection. The images contain a wide variety of terrains, backgrounds, and irregular objects that are not found in everyday life. Furthermore, the resolution of UAV images is often low. Focusing on the detection classes, which are fairly imbalanced with certain types of objects poorly represented, the size of the objects varies greatly from small to large objects depending on the angle, elevation, and actual size.”
How the authors have selected the used datasets of UAV image for the presented study? The Authors should put more detail about the criteria’s used for such a choice. Why only two datasets?
In the Material & Methods section the authors should make clear the contextualization of their method, i.e, rewritten the periods mentioned in the text below:
“Our implementation is different in the execution part. We apply a two-level voting strategy ensemble, both on the single-model and meta-model levels, compared to the single level ensemble in the original paper, shown in Fig. 3. We applied various voting strategies to obtain the results:…”
Figure 3 (Complete pipeline for the proposed approach) should be better explained in the text. Also, the caption should be related to the developed approach instead proposed approach.
The item related to Experiments and Results should be carefully reorganized, i.e., in order to let the readers to understand what in fact is the authors `contribution. In fact the expected base line is not quite clear, and the feedback values (%) presented in the tables are in fact low.
How to be sure that se final prediction (Affirmative; Consensus; or Unanimous) is really adequate for the object’s identification? Authors should verify the accuracy in such a context for the objects detection and include additional comments about the scale and the solutions they have found.
Furthermore, the final framework combining multiple object detection algorithms (both single-stage and two-stage) should also be presented to make clear their contribution in terms of an innovative multi-technique ensemble algorithm for effectively detection of objects over a wide range of scales.
The Conclusion section should be revisited, and parts of its content should be either removed or rewritten, i.e., there are definitions that were already presented previous in the text. There is no need to be repetitive and conclusions should be taken from the presented results. Only conclusions and future possible works related to the subject of the paper should be presented in this section.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 4 Report
Expand UAV in the Abstract.
The first section, "Introduction," must be labeled as 1.
I suggest merging the first and second sections.
Overall I think there is a lack of a common thread in the first two sections. The large number of references included is appreciated. Still, the reader must perceive certain information without consulting all the references (only those in which he is particularly interested or which are necessary for understanding the article). Please, consider rewriting these two sections.
Text spacing is wrong from page 18.
The X-axis in Figures 4, 5, and 10 is difficult to read. Please, consider modifying the labels.
Overall, the feeling you have after reading the article is that it is somewhat chaotic. I think the presented material is quite interesting, and if it were structured and presented correctly, it would be a good article. I think it would be appropriate for the article to be reviewed by an experienced researcher as it does not seem to be very well managed on the aspect previously commented.
I encourage the authors to do this work and resubmit it as the content is good, but there is still some work to be published in its current form.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 3 Report
The Authors did a good job.
Actually, they took into account a better analysis of their results. Also, parts of the manuscript were well reorganized.
However, they should still observe the use of some non-indicated words in a scientific paper. For instance, they should indicate the potential (in %) content of false positives and false negative, i.e., instead to use the expressions “few false positives” and “lots of false negative”. Therefore, they should observe the period below:
“Namely, the affirmative strategy works better when the detections of the base models are mostly correct (that is, there are few false positives) but several objects are left undetected (that is, there are lots of false negatives); the unanimous strategy obtains better results in the opposite case, and the consensus strategy provides a better trade-off when there is not a significant difference be- tween false negatives and false positives in the base models.”
Finally, last but not least, they also should remove from the manuscript the following period “We hope that this study will aid the development of computer vision systems for guiding UAV vehicles and remote sensing applications with robust OD in the future”.
Author Response
Please see the attachment.
Author Response File: Author Response.docx