A New Multi-Scale Convolutional Model Based on Multiple Attention for Image Classification
Abstract
:1. Introduction
2. Related Work
2.1. Different Convolution Methods
2.2. Multi-Scale Features
3. Approach
3.1. Multi-Scale Convolutional Model Based on Single Attention
3.2. Multi-Scale Convolutional Model Based on Multiple Attention
4. Experiments
4.1. Datasets and Experimental Details
4.2. Experiments on CIFAR-10 and CIFAR-100
4.3. Experiments on Fine-Grained Image Datasets
4.4. Comparisons with Prior Methods
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
Res-block | Residual block |
ResNet | Superimposed by multiple Res-blocks |
SE-block | Squeeze-and-Excitation block |
SE-CNN | Superimposed by multiple SE-block |
Res2-block | Hierarchical residual-like connections within residual block |
Res2-CNN | Superimposed by multiple Res2-blocks |
AMS-block | A multi-scale convolutional block based on single attention |
AMS-CNN | Superimposed by multiple AMS-blocks |
DAMS-block | A multi-scale convolutional block based on dual attention |
DAMS-CNN | Superimposed by multiple DAMS-blocks |
MAMS-block | A multi-scale convolutional block based on multiple attention |
MAMS-CNN | Superimposed by multiple MAMS-blocks |
Appendix A
Algorithm 1: AMS-block algorithm |
Require: Input feature maps(x), number of channels(n) |
Define AMS_block(x, n): |
Use the SE-block to get the channels importance dictionary “dict” |
Sort the “dict” and record the index: Index ← tf.nn.top_k(dict, n) |
According to the record of index, y is sorted by x: y ← tf.batch_gather(y, index) |
Divide the sorted channels into k groups, denoted as [, , …, ] |
Perform a convolution operation on the first group: z ← Conv() |
For i in (2, k) do: |
Add the feature maps after convolution and the current group: y ← Add(, z) |
Perform a convolution operation on this group: ← Conv(y) |
Group each group along the channels: z ← Contant(z, ) |
Assign y to z: z ← |
End for |
Add the x and z by using the residual learning idea: z ← Add(z, x) |
Use point-wise convolution to get the feature maps: Block ← Conv(z) |
Return block |
References
- Cao, Y.J.; Jia, L.L.; Chen, Y.X.; Lin, N.; Yang, C.; Zhang, B.; Liu, Z.; Li, X.X.; Dai, H.H. Recent Advances of Generative Adversarial Networks in Computer Vision. IEEE Access 2019, 7, 14985–15006. [Google Scholar] [CrossRef]
- Choi, J.; Kwon, J.; Lee, K.M. Real-Time Visual Tracking by Deep Reinforced Decision Making. Comput. Vis. Image Underst. 2018, 171, 10–19. [Google Scholar] [CrossRef] [Green Version]
- Shen, D.H.; Zhang, Y.Z.; Henao, R.; Su, Q.L.; Carin, L. Deconvolutional Latent-Variable Model for Text Sequence Matching. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 5438–5445. [Google Scholar]
- Liu, J.; Ren, H.L.; Wu, M.L.; Wang, J.; Kim, H.j. Multiple Relations Extraction Among Multiple Entities in Unstructured Text. Soft Comput. 2018, 22, 4295–4305. [Google Scholar] [CrossRef]
- Kim, G.; Lee, H.; Kim, B.K.; Oh, S.H.; Lee, S.Y. Unpaired Speech Enhancement by Acoustic and Adversarial Supervision for Speech Recognition. IEEE Signal Process. Lett. 2019, 26, 159–163. [Google Scholar] [CrossRef]
- Deena, S.; Hasan, M.; Doulaty, M.; Saz, O.; Hain, T. Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment. IEEE/ACM Trans. Audio Speech Lang. 2019, 27, 572–582. [Google Scholar] [CrossRef] [Green Version]
- Xie, J.J.; Li, A.Q.; Zhang, J.G.; Cheng, Z.A. An Integrated Wildlife Recognition Model Based on Multi-Branch Aggregation and Squeeze-And-Excitation Network. Appl. Sci. 2019, 9, 2749. [Google Scholar] [CrossRef] [Green Version]
- Yang, Y.D.; Wang, X.F.; Zhao, Q.; Sui, T.T. Two-Level Attentions and Grouping Attention Convolutional Network for Fine-Grained Image Classification. Appl. Sci. 2019, 9, 1939. [Google Scholar] [CrossRef] [Green Version]
- Li, Z.L.; Dong, M.H.; Wen, S.P.; Hu, X.; Zhou, P.; Zeng, Z.G. CLU-CNNs: Object Detection for Medical Images. Neurocomputing 2019, 350, 53–59. [Google Scholar] [CrossRef]
- Jiang, Y.; Peng, T.T.; Tan, N. CP-SSD: Context Information Scene Perception Object Detection Based on SSD. Appl. Sci. 2019, 9, 2785. [Google Scholar] [CrossRef] [Green Version]
- Yang, J.F.; Liang, J.; Shen, H.; Wang, K.; Rosin, P.L.; Yang, M.H. Dynamic Match Kernel with Deep Convolutional Features for Image Retrieval. IEEE Trans. Image Process. 2018, 27, 5288–5301. [Google Scholar] [CrossRef] [Green Version]
- Yang, X.; Wang, N.N.; Song, B.; Gao, X.B. BoSR: A CNN-Based Aurora Image Retrieval Method. Neural Netw. 2019, 116, 188–197. [Google Scholar] [CrossRef] [PubMed]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet Classification with Deep Convolutional Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Itti, L.; Koch, C.; Niebur, E. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef] [Green Version]
- Itti, L.; Koch, C. Computational Modelling of Visual Attention. Nat. Rev. Neurosci. 2001, 2, 194–203. [Google Scholar] [CrossRef] [Green Version]
- Meur, O.E.; Callet, P.L.; Barba, D.; Thoreau, D. A Coherent Computational Approach to Model Bottom-Up Visual Attention. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 802–817. [Google Scholar] [CrossRef] [Green Version]
- Corbetta, M.; Shulman, G.L. Control of Goal-Directed and Stimulus-Driven Attention in the Brain. Nat. Rev. Neurosci. 2002, 3, 201–215. [Google Scholar] [CrossRef]
- Baluch, F.; Itti, L. Mechanisms of Top-Down Attention. Trends Neurosci. 2011, 34, 210–224. [Google Scholar] [CrossRef]
- Zhang, J.M.; Bargal, S.A.; Lin, Z.; Brandt, J.; Shen, X.H.; Sclaroff, S. Top-Down Neural Attention by Excitation Backprop. Int. J. Comput. Vis. 2018, 126, 1084–1102. [Google Scholar] [CrossRef] [Green Version]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 7132–7141. [Google Scholar]
- Yang, Y.D.; Wang, X.F.; Zhang, H.Z. Local Importance Representation Convolutional Neural Network for Fine-Grained Image Classification. Symmetry 2018, 10, 479. [Google Scholar] [CrossRef] [Green Version]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 3–19. [Google Scholar]
- Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. arXiv 2019, arXiv:1904.01169. [Google Scholar] [CrossRef] [Green Version]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Lin, M.; Chen, Q.; Yan, S.C. Network In Network. arXiv 2014, arXiv:1312.4400. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 1–9. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. PRethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
- Chollet, F. Xception: Deep Learning With Depthwise Separable Convolutions. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Howard, A.G.; Zhu, M.L.; Chen, B.; Kalenichenko, D.; Wang, W.J.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Zhang, X.Y.; Zhou, X.Y.; Lin, M.X.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 6848–6856. [Google Scholar]
- Zhang, T.; Qi, G.J.; Xiao, B.; Wang, J.D. Interleaved Group Convolutions for Deep Neural Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4373–4382. [Google Scholar]
- Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the International Conference on Learning Representations, Caribe Hilton, San Juan, Puerto Rico, 2–4 October 2016; pp. 1–13. [Google Scholar]
- Dai, J.F.; Qi, H.Z.; Xiong, Y.W.; Li, Y.; Zhang, G.D.; Hu, H.; Wei, Y.C. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Li, X.; Wang, W.H.; Hu, X.L.; Yang, J. Selective Kernel Networks. arXiv 2019, arXiv:1903.06586. [Google Scholar]
- Rupesh, K.S.; Klaus, G.; Jürgen, S. Highway Networkss. arXiv 2015, arXiv:1505.00387v2. [Google Scholar]
- Sergey, Z.; Nikos, K. Wide Residual Networks. arXiv 2017, arXiv:1605.07146v4. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–27. [Google Scholar]
- Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.M.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.F.; Shi, J.P.; Jia, J.Y. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 8759–8768. [Google Scholar]
- Lin, D.; Shen, D.G.; Shen, S.T.; Ji, Y.F.; Lischinski, D.N.; Cohen-Or, D.; Huang, H. ZigZagNet: Fusing Top-Down and Bottom-Up Context for Object Segmentation. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7490–7499. [Google Scholar]
- Li, W.B.; Wang, Z.C.; Yin, B.Y.; Peng, Q.X.; Du, Y.M.; Xiao, T.Z.; Yu, G.; Lu, H.T.; Wei, Y.C.; Sun, J. Rethinking on Multi-Stage Networks for Human Pose Estimation. arXiv 2019, arXiv:1901.00148. [Google Scholar]
- Zhao, Q.J.; Sheng, T.; Wang, Y.T.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H.B. M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 1–8. [Google Scholar]
- Xiao, T.T.; Liu, Y.C.; Zhou, B.L.; Jiang, Y.N.; Sun, J. Unified Perceptual Parsing for Scene Understanding. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 1–17. [Google Scholar]
- Yang, L.; Song, Q.; Wang, Z.H.; Jiang, M. Parsing R-CNN for Instance-Level Human Analysis. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 364–373. [Google Scholar]
- Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; Vedaldi, A. Fine-grained visual classification of aircraft. arXiv 2013, arXiv:1306.5151. [Google Scholar]
- Krause, J.; Stark, M.; Jia, D.; Li, F.F. 3D Object Representations for Fine-Grained Categorization. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 3–6 December 2013; pp. 554–561. [Google Scholar]
- Gosselin, P.H.; Murray, N.; Jégou, H.; Perronnin, F. Revisiting the Fisher vector for fine-grained classification. Pattern Recogn. Lett. 2014, 49, 92–98. [Google Scholar] [CrossRef] [Green Version]
- Zhao, B.; Wu, X.; Feng, J.S.; Peng, Q.; Yan, S.C. Diversified Visual Attention Networks for Fine-Grained Object Classification. IEEE Trans. Multimed. 2017, 19, 1245–1256. [Google Scholar] [CrossRef] [Green Version]
- Gao, Y.; Beijbom, O.; Zhang, N.; Darrell, T. Compact Bilinear Pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 317–326. [Google Scholar]
- Kong, S.; Fowlkes, C. Low-Rank Bilinear Pooling for Fine-Grained Classification. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 365–374. [Google Scholar]
Model | Conv1 | Conv2 | Conv3 | Conv4 | Dence |
---|---|---|---|---|---|
ResNet-32 | (n1, 3 × 3, 1) | GAP (8,8) Dence 10/100) | |||
SE-CNN | – | – | |||
Res2-CNN_A/B | Res2-block | Res2-block | |||
AMS-CNN_A/B | AMS-block | AMS-block | |||
DAMS-CNN_A/B | DAMS-block | DAMS-block | |||
MAMS-CNN_A/B | MAMS-block | MAMS-block |
Model | Groups | Accuracy (%) | Parameters | SPEFPS | FLOPs |
---|---|---|---|---|---|
ResNet-32 | 1 | 92.49 | 0.46M | 22 | 1.87 × |
SE-CNN-32 | 1 | 92.75 | 0.47M | 29 | 1.90 × |
Res2-CNN_A | 2 | 92.64 | 0.49M | 26 | 1.96 × |
4 | 92.77 | 0.49M | 27 | 1.95 × | |
8 | 92.25 | 0.48M | 29 | 1.94 × | |
Res2-CNN_B | 2 | 92.77 | 0.49M | 55 | 1.97 × |
4 | 92.88 | 0.47M | 65 | 1.88 × | |
8 | 92.82 | 0.44M | 90 | 1.78 × | |
AMS-CNN_A | 2 | 93.16 | 0.50M | 29 | 2.01 × |
4 | 93.13 | 0.49M | 33 | 1.97 × | |
8 | 93.31 | 0.49M | 37 | 1.95 × | |
AMS-CNN_B | 2 | 93.18 | 0.50M | 80 | 2.01 × |
4 | 93.14 | 0.50M | 112 | 1.72 × | |
8 | 93.35 | 0.46M | 169 | 1.57 × | |
DAMS-CNN_A | 2 | 93.58 | 0.50M | 31 | 2.02 × |
4 | 93.39 | 0.49M | 33 | 1.97 × | |
8 | 93.41 | 0.49M | 37 | 1.95 × | |
DAMS-CNN_B | 2 | 93.33 | 0.50M | 91 | 2.01 × |
4 | 93.39 | 0.50M | 128 | 1.72 × | |
8 | 93.60 | 0.46M | 185 | 1.58 × |
Model | Groups | Accuracy (%) | Parameters | SPE | FLOPs |
---|---|---|---|---|---|
ResNet-32 | 1 | 73.09 | 1.87M | 32 | 7.50 × |
SE-CNN | 1 | 74.19 | 1.90M | 42 | 7.61 × |
Res2-CNN_A | 2 | 73.47 | 1.96M | 36 | 7.85 × |
4 | 73.48 | 1.95M | 37 | 7.80 × | |
8 | 74.19 | 1.94M | 39 | 7.75 × | |
Res2-CNN_B | 2 | 74.35 | 1.95M | 80 | 7.81 × |
4 | 74.35 | 1.86M | 94 | 7.47 × | |
8 | 75.17 | 1.76M | 117 | 7.05 × | |
AMS-CNN_A | 2 | 75.32 | 2.01M | 40 | 8.06 × |
4 | 75.25 | 1.97M | 42 | 7.88 × | |
8 | 75.36 | 1.95M | 47 | 7.78 × | |
AMS-CNN_B | 2 | 75.31 | 1.98M | 111 | 7.96 × |
4 | 75.87 | 1.98M | 154 | 6.80 × | |
8 | 75.42 | 1.82M | 201 | 6.22 × | |
DAMS-CNN_A | 2 | 75.52 | 1.99M | 38 | 7.99 × |
4 | 75.34 | 1.97M | 43 | 7.80 × | |
8 | 75.25 | 1.94M | 48 | 7.71 × | |
DAMS-CNN_B | 2 | 75.41 | 1.99M | 125 | 7.97 × |
4 | 75.38 | 1.99M | 167 | 6.81 × | |
8 | 75.43 | 1.82M | 216 | 6.23 × |
Model | Conv1 | Conv2 | Conv3 | Conv4 | Conv5 | Conv6 | Dence |
---|---|---|---|---|---|---|---|
ResNet | (16, 7, 2) (16, 5, 2) (16, 3, 2) | GAP (7, 7) Dence (100/196) | |||||
SE-CNN | – | – | – | – | |||
Res2-CNN | Res2-block | Res2-block | Res2-block | Res2-block | |||
AMS-CNN | AMS-block | AMS-block | AMS-block | AMS-block | |||
DAMS-CNN | DAMS-block | DAMS-block | DAMS-block | DAMS-block | |||
MAMS-CNN | MAMS-block | MAMS-block | MAMS-block | MAMS-block |
Model | Accuracy (%) | Parameters | SPE | FLOPs |
---|---|---|---|---|
ResNet | 83.05 | 10.32M | 87 | 4.13 × |
SE-CNN | 84.22 | 10.39M | 95 | 4.17 × |
Res2-CNN | 82.27 | 10.71M | 94 | 4.35 × |
AMS-CNN | 85.42 | 10.85M | 100 | 4.35 × |
DAMS-CNN | 85.57 | 10.85M | 104 | 4.58 × |
MAMS-CNN | 86.56 | 10.90M | 111 | 4.37 × |
Model | Accuracy (%) | Parameters | SPE | FLOPs |
---|---|---|---|---|
ResNet | 82.84 | 10.33M | 119 | 4.12 × |
SE-CNN | 83.09 | 10.41M | 132 | 4.16 × |
Res2-CNN | 83.80 | 10.73M | 129 | 4.28 × |
AMS-CNN | 88.65 | 10.87M | 142 | 4.34 × |
DAMS-CNN | 89.02 | 10.87M | 146 | 4.34 × |
MAMS-CNN | 89.15 | 10.93M | 156 | 4.36 × |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, Y.; Xu, C.; Dong, F.; Wang, X. A New Multi-Scale Convolutional Model Based on Multiple Attention for Image Classification. Appl. Sci. 2020, 10, 101. https://doi.org/10.3390/app10010101
Yang Y, Xu C, Dong F, Wang X. A New Multi-Scale Convolutional Model Based on Multiple Attention for Image Classification. Applied Sciences. 2020; 10(1):101. https://doi.org/10.3390/app10010101
Chicago/Turabian StyleYang, Yadong, Chengji Xu, Feng Dong, and Xiaofeng Wang. 2020. "A New Multi-Scale Convolutional Model Based on Multiple Attention for Image Classification" Applied Sciences 10, no. 1: 101. https://doi.org/10.3390/app10010101
APA StyleYang, Y., Xu, C., Dong, F., & Wang, X. (2020). A New Multi-Scale Convolutional Model Based on Multiple Attention for Image Classification. Applied Sciences, 10(1), 101. https://doi.org/10.3390/app10010101