Robust Data Augmentation Generative Adversarial Network for Object Detection
Round 1
Reviewer 1 Report
A GAN is proposed to augment small datasets for CNN-based object detection and object recognition tasks with application demonstrated in the area of detecting fires.
The authors propose adding training vectors by training a new type of GAN. This GAN takes as input images that do not have the object(s) of interest and a bounding box mask indicating where to place a candidate instance of an object to detect. The GAN provides as output a new instance of the object inserted into the image within the bounding box mask. Existing work does not explicitly include mechanisms to locally insert object instances and this work proposes a method to do this.
A training loss based on maintaining image fidelity (termed background loss) and quality of the object reproduction of the image (based on mutual information) is used to perform the object insertion. The quality of reproduction is measured via an L2 difference of the discriminator’s latent representation of the object and the generator’s latent representation.
Experiments were shown for object detection where they trained YOLOv5 using FiSmo dataset and also trained with augmented dataset using RDAGAN. They also compare their generated images from RDAGAN with other baseline methods like CycleGAN and CUT. Their results show that with their approach the flame detection performance of YOLOv5 was improved. Specifically, results using YOLOv5 claim an increase in localization performance where increases of ~0.04 and ~0.03 are reported for specific subsets of the output. This improvement is over the results for the same network when the proposed GAN augmentation is not provided.
Questions:
-
How are bounding boxes specified for the placement of the objects? Line 188 says “by sampling the coordinates of a masked region”. How is this region found?
-
Please unambiguously specify the discriminators used in 3.2.2. Discriminator. It is said the discriminators are “is similar to that of the PatchGAN” and “similar to the InfoGAN.” Please state the discriminator concisely and specifically.
-
GANs typically have a discriminator that does not use an encoding of the image as a latent variable for input. This makes the discriminator’s internal representation of the object classes unique and independent from the generator. In this work the information loss depends on shared representation since the L2 norm of this representation is the loss. How is training performed on the GAN when the discriminator needs to have a common latent space? This needs to be specifically described.
-
Authors should acknowledge the limited scope of their data and results which may not translate across all object insertion problems. Specifically, it is not clear such methods would be successful for strong structural information, e.g., a penguin or a car. This should be clearly stated or results to the contrary should be published.
Comments:
The literature review does not provide sufficient review of GAN-based augmentation papers with different domains of datasets. They mentioned only a few papers that worked on fire scene dataset with data augmentation. There are a lot of papers that worked on this GAN based augmentation with medical images or remote sensing images. Additional consideration of works such as:
Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, Hayit Greenspan, GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification, Neurocomputing, Volume 321, 2018, Pages 321-331, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2018.09.013.
N. Lv et al., "Remote Sensing Data Augmentation Through Adversarial Training," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 9318-9333, 2021, doi: 10.1109/JSTARS.2021.3110842.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
I have carefully read the paper titled " Robust Data Augmentation Generative Adversarial Network for Object Detection" which uses a dataset containing a small number of fire images with strong occlusions to convert clean fire-free images to fire images. It can generate labeled data for fire images and the generated images can improve the YOLOv5 model’s fire detection performance. But there are still some problems in this paper in the current version as follows:
1. Insufficient summary of the current state of research on the generation of fire-image.
2. Please introduce the experimental data.
3. Are the adversarial loss and the information loss in the “3.1. Object Generation Network” part and the “3.2. Image Translation Network” second part the same?
Author Response
Please see the attachment.
Author Response File: Author Response.pdf