Scene Recognition Based on Recurrent Memorized Attention Network
Abstract
:1. Introduction
- (1)
- We propose a novel framework for an end-to-end scene recognition task that performs object-based scene classification by recurrently locating and memorizing objects with an attention mechanism.
- (2)
- Base on the proposed framework, we propose a multi-task mechanism that contiguously attends on the different essential objects in a scene image and recurrently performs memory fusion of the features of object focused by an attention model to improve the scene recognition accuracy.
- (3)
- We construct a new scene dataset named Scene 30, which contains 4608 color images of 30 different scene categories, including both indoor and outdoor scenes.
2. Related Work
2.1. Low-Level Image Feature Based Scene Recognition
2.2. Object-Based Scene Recognition
3. Scene Recognition Based on RMAN
3.1. Feature Extraction Module
3.2. Attention Localization Module
- (1)
- Initial or regress by LSTM network (described in Section 3.4) to generate the parameter matrix M required for transformation.
- (2)
- Coordinates mapping: get the coordinate grid of the corresponding fI based on the coordinates of fk.
- (3)
- Bilinear interpolation: generate a feature map fk corresponding to the attention area through bilinear interpolation.
3.3. Recurrent Memorized Attention Module
3.4. Process of Model Learning
- (1)
- Positioning redundancy. The constraint of the loss function on the RMAN model will lead to the Spatial Transformer (ST) layer being trapped to the same area (local minimum) in the image, which results in the extracted features of essential scene object being redundant. Therefore, although the number of network iterations is increased, the essential objects area found in each iteration are actually similar or exactly same to previous one, which is contradictory to the model design as the increasement of the complexity of the network does not bring improvement to the final scene classification result.
- (2)
- Ignoring the small objects. The Spatial Transformer (ST) layer will locate relatively large objects and ignore the small objects for model classification. Actually, in the scene classification task, due to the complexity and diversity of the scene images, the small objects in the scene image often have an important role for scene recognition. For example, small objects such as “mouse” and “telephone” in an image are important to characterize the office scene category.
- Anchor Constraint
- Scale Constraint
4. Experimental Result
4.1. Dataset
4.2. Experimental Configuration
4.3. Results and Discussion
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; Oliva, A. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2014; pp. 487–495. [Google Scholar]
- Xiao, J.; Hays, J.; Ehinger, K.A.; Oliva, A.; Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3485–3492. [Google Scholar]
- Quattoni, A.; Torralba, A. Recognizing indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 413–420. [Google Scholar]
- Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
- Margolin, R.; Zelnik-Manor, L.; Tal, A. Otc: A novel local descriptor for scene classification. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 377–391. [Google Scholar]
- Wu, J.; Rehg, J.M. Centrist: A visual descriptor for scene categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 1489–1501. [Google Scholar]
- Xiao, Y.; Wu, J.; Yuan, J. mCENTRIST: A multi-channel feature generation mechanism for scene categorization. IEEE Trans. Image Process. 2013, 23, 823–836. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2012; pp. 1097–1105. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Lafferty, J.; McCallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data; Morgan Kaufmann: San Francisco, CA, USA, 2001. [Google Scholar]
- Stamp, M. A Revealing Introduction to Hidden Markov Models; Department of Computer Science San Jose State University: San Jose, CA, USA, 2004; pp. 26–56. [Google Scholar]
- Geman, S.; Graffigne, C. Markov random field image models and their applications to computer vision. In Proceedings of the International Congress of Mathematicians, Berkeley, CA, USA, 3–11 August 1986; Volume 1, p. 2. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Othman, K.M.; Rad, A.B. An indoor room classification system for social robots via integration of cnn and ecoc. Appl. Sci. 2019, 9, 470. [Google Scholar] [CrossRef] [Green Version]
- Chen, P.H.; Lin, C.J.; Schölkopf, B. A tutorial on ν-support vector machines. Appl. Stoch. Models Bus. Ind. 2005, 21, 111–136. [Google Scholar] [CrossRef]
- Rafiq, M.; Rafiq, G.; Agyeman, R.; Jin, S.I.; Choi, G.S. Scene classification for sports video summarization using transfer learning. Sensors 2020, 20, 1702. [Google Scholar] [CrossRef] [Green Version]
- Li, L.J.; Socher, R.; Fei-Fei, L. Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2036–2043. [Google Scholar]
- Sudderth, E.B.; Torralba, A.; Freeman, W.T.; Willsky, A.S. Learning hierarchical models of scenes, objects, and parts. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV 05), Beijing, China, 17–21 October 2005; Volume 1, pp. 1331–1338. [Google Scholar]
- Choi, M.J.; Lim, J.J.; Torralba, A.; Willsky, A.S. Exploiting hierarchical context on a large database of object categories. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 129–136. [Google Scholar]
- Li, C.; Parikh, D.; Chen, T. Automatic discovery of groups of objects for scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2735–2742. [Google Scholar]
- Wu, R.; Wang, B.; Wang, W.; Yu, Y. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1287–1295. [Google Scholar]
- Cheng, X.; Lu, J.; Feng, J.; Yuan, B.; Zhou, J. Scene recognition with objectness. Pattern Recognit. 2018, 74, 474–487. [Google Scholar] [CrossRef]
- Shao, X.; Zhang, J.; Bao, B.K.; Xia, Y. Automatic scene recognition based on constructed knowledge space learning. IEEE Access 2019, 7, 102902–102910. [Google Scholar] [CrossRef]
- Shi, J.; Zhu, H.; Yu, S.; Wu, W.; Shi, H. Scene categorization model using deep visually sensitive features. IEEE Access 2019, 7, 45230–45239. [Google Scholar] [CrossRef]
- Yin, W.; Ebert, S.; Schütze, H. Attention-based convolutional neural network for machine comprehension. arXiv 2016, arXiv:1602.04341. [Google Scholar]
- Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
- Lin, D.; Shen, X.; Lu, C.; Jia, J. Deep lac: Deep localization, alignment and classification for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1666–1674. [Google Scholar]
- Liu, X.; Xia, T.; Wang, J.; Yang, Y.; Zhou, F.; Lin, Y. Fully convolutional attention networks for fine-grained recognition. arXiv 2016, arXiv:1603.06765. [Google Scholar]
- Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5209–5217. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2015; pp. 91–99. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2015; pp. 2017–2025. [Google Scholar]
- Xue, X.; Zhang, W.; Zhang, J.; Wu, B.; Fan, J.; Lu, Y. Correlative multi-label multi-instance image annotation. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 651–658. [Google Scholar]
- Wang, J.; Yang, Y.; Mao, J.; Huang, Z.; Huang, C.; Xu, W. Cnn-rnn: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2285–2294. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Chollet, F. Keras. Available online: https://github.com/keras-team/keras (accessed on 20 October 2020).
- Juneja, M.; Vedaldi, A.; Jawahar, C.V.; Zisserman, A. Blocks that shout: Distinctive parts for scene classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 923–930. [Google Scholar]
- Lin, D.; Lu, C.; Liao, R.; Jia, J. Learning important spatial pooling regions for scene classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3726–3733. [Google Scholar]
- Gong, Y.; Wang, L.; Guo, R.; Lazebnik, S. Multi-scale orderless pooling of deep convolutional activation features. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 392–407. [Google Scholar]
- Sharif Razavian, A.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 806–813. [Google Scholar]
- Zuo, Z.; Wang, G.; Shuai, B.; Zhao, L.; Yang, Q.; Jiang, X. Learning discriminative and shareable features for scene classification. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 552–568. [Google Scholar]
Methods | Accuracy (%) |
---|---|
VGG-16 [42] | 82.51 |
PlaceNet 205 (VGG-16) [1] | 86.77 |
PlaceNet 365 (VGG-16) [10] | 86.67 |
HybridNet 1365 (VGG-16) [10] | 87.08 |
Constructed Knowledge Sub-graph Learning [28] | 88.29 |
RMAN | 89.32 |
RMAN_fc + SVM | 89.47 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shao, X.; Zhang, X.; Tang, G.; Bao, B. Scene Recognition Based on Recurrent Memorized Attention Network. Electronics 2020, 9, 2038. https://doi.org/10.3390/electronics9122038
Shao X, Zhang X, Tang G, Bao B. Scene Recognition Based on Recurrent Memorized Attention Network. Electronics. 2020; 9(12):2038. https://doi.org/10.3390/electronics9122038
Chicago/Turabian StyleShao, Xi, Xuan Zhang, Guijin Tang, and Bingkun Bao. 2020. "Scene Recognition Based on Recurrent Memorized Attention Network" Electronics 9, no. 12: 2038. https://doi.org/10.3390/electronics9122038
APA StyleShao, X., Zhang, X., Tang, G., & Bao, B. (2020). Scene Recognition Based on Recurrent Memorized Attention Network. Electronics, 9(12), 2038. https://doi.org/10.3390/electronics9122038