BASN—Learning Steganography with a Binary Attention Mechanism
Round 1
Reviewer 1 Report
The article ‘BASN - Learning Steganography with Binary Attention Mechanism’ proposes to use two attention mechanisms: image texture complexity (ITC) and minimizing feature distortion (MFD) to improve information transmitted by image steganography and increase information payload. Overall, the article is well-written and introduces the topic well.
However, I have got some major comments. The authors do not make a detailed study of the existing state of the art. During last few years, there has been works on the use of ITC for improving steganography. A detailed discussion of these is missing in the introductory section. It’s not clear why MFD is used along with ITC (lines 43-51).
I am not sure whether Equation 1 is correct: use of squares on both sides of minus (-). How is fitc calculated (Equation 4)? It’s not also not clear which of these equations concerning ITC and MFD are author contributions?I would suggest the authors to improve the section 2. It’s difficult to follow and not all the terms are completely defined. Also, is there any direct relation between hidden information and embedded image, (referring to Figures 2 and 12).
The results obtained in section 4 need to be further explained, especially Figures 15 (a, b, c, d). Furthermore, there is no discussion on the limitations of their approach.
The conclusion section also misses a discussion on perspectives.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Dear Authors, the paper is written sufficiently with an accurate bibliography.
The paper proposes a binary attention mechanism to improve security of image steganography while increasing embedding payload capacity.
I would like to suggest several specific improvements, listed below, and a general improvement of the work based on a better explanation of all the described concepts in order to help and guide the reader throughout the whole paper.
Below are the specific improvements:
Page 4: Explain in more detail the architecture shown in Figure 4 (for example why it is used ELU function, writing the meaning of its acronym and why it is used sigmoid) with the meaning of the used colours; Page 4: Explain in more detail the reason for wich a kernel size of 7 is used. While the reason of the choise of the median smoothing technique is clarified from figure 5, the choise of the kernel is proposed to the reader without an explanation (for example if the authors have tried to use other kernel sizes without success); Page 5: Explain in more detail the role of λ in Eq. 8 and try to help the reader to understand how percentage values of attention area average and texture reduction gain average are obtained (for example by adding a table with values); Page 6 (line 118):Why is it shown Figure 9 before Figure 8 in the text? Try to explain in more detail Figure 8. Define L1 and L2; Page 8: Explicit what the Cartesian axes indicate; Page 9: Try to explain in more detail Figure 12 with the meaning of the used colours; Line 141: Probably the ITC loss is Equation 8; add a figure that shows the second phase of finetune process.Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Firstly, I would like to thank the authors for taking into consideration my review comments. The authors have added clarity to their figures, with the updated captions. Equations have been corrected. Conclusion section presents the limitations of their work and also suggests future course of actions.
The authors have added a number of references in section 1. However, it would be interesting to compare their proposed approach to these proposed existing approaches.
Minor remarks:
- Line 56-57 not clear. which therefore recreating or inferring from the embedded image what attention or how much capacity was available in the original cover image is possible.
Author Response
Response to Reviewer 1 Comments
Point 1:The authors have added a number of references in section 1. However, it would be interesting to compare their proposed approach to these proposed existing approaches.
Response 1: The references we added does not relate that much with our models since we focus on hiding the steganography from the neural networks to make sure the neural networks can execute its original tasks without getting affected.
Point 2: Line 56-57 not clear. which therefore recreating or inferring from the embedded image what attention or how much capacity was available in the original cover image is possible.
Response 2:We have restated the sentence from
With the help of MFD model, we align the latent space of the cover image and the embedded image, which therefore recreating or inferring from the embedded image what attention or how much capacity was available in the original cover image is possible.
into
With the help of MFD model, we align the latent space of the cover image and the embedded image so that we can infer the original cover attention map using solely the embedded image. Afterwards, the hidden information is extracted at the locations and capacity indicated by the inferred attention map.
Author Response File: Author Response.pdf