A Scene Classification Model Based on Global-Local Features and Attention in Lie Group Space
Abstract
:1. Introduction
- We propose a multi-scale branching model for RSSC. In this model, it aims to extract the multi-scale and more discriminative features of the scene in a more fine-grained manner.
- We propose global and local fusion modules to achieve efficient fusion between features at different scales. The module contains spatial and channel attention mechanisms and shortcut connections, which can effectively improve the model to focus on the crucial regions and ignore the irrelevant regions.
- Compared with some existing models, our proposed model is more lightweight and achieves a better balance between classification accuracy and computational performance.
2. Related Works
2.1. RSSC Based on Features of Different Levels
2.2. RSSC Based on Attention Mechanism
3. Method
3.1. Overall Framework
3.2. Multi-Scale Sampling of HRRIs
3.3. Hierarchical Parallel Model
3.4. Global Feature Extraction
3.5. Local Feature Extraction
- The above operation can effectively increase the feature information interaction between different partition modules.
- We have achieved multi-channel and multi-dimensional feature extraction in a finer granularity manner, effectively expanding the receptive field while suppressing the increase in model parameters, improving the computational efficiency of the model, and reducing the number of model parameters.
- More specifically, traditional multi-scale feature extraction mainly adopts multiple parallel branch structures, each branch contains a fixed kernel (such as , ), without considering the relationship between the feature maps of different branches. Compared with our model, the feature maps obtained from each branch are fed to the next branch, which achieves the reuse of features and promotes the interaction of feature information between different channels. In addition, the receptive field can be effectively expanded in this way, such as in two consecutive convolution operations, the receptive field of the first convolution is (each output value addresses 49 values in the input feature map), the receptive field of the second convolution is effectively (each output value addresses 81 values in the input feature map), with the same size kernel itself, but the receptive field is enlarged.
3.6. Global-Local Fusion
3.6.1. Channel Attention
3.6.2. Spatial Attention
3.6.3. Fusion Module
4. Experiments
4.1. Experimental Environment
4.1.1. Datasets
4.1.2. Experimental Parameter Setting and Evaluation Metrics
4.2. Comparison with SOTA Models
4.2.1. Experimental Results of AID
- Our proposed model achieved 95.09% and 97.31% at a training ratio of 20% and 50%, which improved 2.65%, 2.8%, and 1.98% compared to ResNet50 [62], ResNet50+CBAM [1], and ResNet50+HFAM [1], respectively. The experimental results indicate that our proposed model can achieve better classification results.
- The model with an added attention mechanism has higher classification accuracy compared to traditional models without an added attention mechanism. For example, the ResNet50+CBAM [1] model improved by 0.13% compared to the ResNet50 [62] model, and the VGG16+HFAM [1] model improved by 6.25% compared to the VGG-VD-16 [10] model. The experimental results verified the positive role of the attention mechanism in scene classification.
- In our proposed model, crucial feature information of key regions in the scene can be selectively focused on. According to the experimental results, our model improved by 0.67%, 0.4%, and 1.98% compared to the Fine-tune MobileNet V2 [63], DS-SURF-LLC+Mean-Std-LLC+MO-CLBP-LLC [64], and ResNet50+HFAM [1] models, respectively. Therefore, we believe that by combining channel attention and spatial attention, we can obtain more discriminative features and achieve better classification performance.
Models | Training Ratios | |
---|---|---|
20% | 50% | |
CaffeNet [10] | ||
VGG-VD-16 [10] | ||
GoogLeNet [10] | ||
Fusion by addition [65] | − | |
LGRIN [30] | ||
TEX-Net-LF [66] | ||
DS-SURF-LLC+Mean-Std-LLC+MO-CLBP-LLC [64] | ||
LiG with RBF kernel [55] | ||
ADPC-Net [67] | ||
VGG19 [62] | ||
ResNet50 [1] | ||
ResNet50+SE [1] | ||
ResNet50+CBAM [1] | ||
ResNet50+HFAM [1] | ||
InceptionV3 [62] | ||
DenseNet121 [68] | ||
DenseNet169 [68] | ||
MobileNet [69] | ||
EfficientNet [70] | ||
Two-stream deep fusion Framework [71] | ||
Fine-tune MobileNet V2 [63] | ||
SE-MDPMNet [63] | ||
Two-stage deep feature Fusion [72] | − | |
Contourlet CNN [73] | − | |
LCPP [74] | ||
RSNet [75] | ||
SPG-GAN [76] | ||
TSAN [77] | ||
LGDL [29] | ||
VGG16+CBAM [1] | ||
VGG16+SE [1] | ||
VGG16+HFAM [1] | ||
Proposed |
4.2.2. Experimental Results of RSICB-256
- The ViT-based model has achieved better performance because it uses the global attention mechanism to simulate the global environment. In our model, we utilize both the global attention mechanism and the local attention mechanism, and compared with the ViT-based model, the classification accuracy has been improved to some extent.
4.2.3. Experimental Results of NWPU-RESISC45
- Since the category of scenes has increased compared with the above two datasets, and the training ratio is only 10% and 20%, the classification results of all models have decreased compared with the above two datasets.
- Compared with other models, our proposed model still has higher classification accuracy. Specifically, when the training ratio is 10%, the ratios of 0.16%, 2.54%, and 1.11% are increased, respectively, compared to ResNet50+EAM [81], ResNet101+HFAM [1], and ViT-B-16 [78]. When the training ratio is 20%, the ratios of 1.44%, 1.14%, and 4.03% are increased, respectively, compared to PVT-V2-B0 [80], LiG with RBF kernel [55], and ResNet101 [1]. Experimental results show that our proposed model is also effective on datasets with multiple scenes.
- Under the same training ratio, the ViT-based model (such as ViT-B-16 [78], T2T-ViT-12 [79], PVT-V2-B0 [80]) achieves higher classification accuracy than the classical CNN model (such as GoogLeNet [82]), mainly because the ViT-based model makes up for the shortcomings of the classical CNN model in the global context. However, in addition to the global context feature information, the proposed model also utilizes a local spatial attention mechanism to extract local detail feature information, further filling the gap in local feature information. Furthermore, since the transformer-based method lacks convolutional inductive bias, it requires more training data samples. However, with a training ratio of 10% and 20%, for example, the classification accuracy of ViT-B-16 [78] is 90.96% and 93.36%, respectively. In terms of classification accuracy, the classification accuracy of the transformer-based method is lower than that of our model. In terms of computational complexity, however, it is more complex than our proposed model.
Models | Training Ratios | |
---|---|---|
10% | 20% | |
GoogLeNet [82] | ||
SCCov [83] | ||
ACNet [61] | ||
ViT-B-16 [78] | ||
T2T-ViT-12 [79] | ||
PVT-V2-B0 [80] | ||
LGRIN [30] | ||
LiG with RBF kernel [55] | ||
ResNet50 [1] | ||
ResNet50+EAM [81] | ||
ResNet50+SE [1] | ||
ResNet50+CBAM [1] | ||
ResNet50+HFAM [1] | ||
ResNet101 [1] | ||
ResNet101+SE [1] | ||
ResNet101+CBAM [1] | ||
ResNet101+HFAM [1] | ||
VGG16 [1] | ||
VGG16+SE [1] | ||
VGG16+CBAM [1] | ||
VGG16+HFAM [1] | ||
Proposed |
4.3. Comparison of the Number of Model Parameters and Computational Performance
- When the classification accuracy of most models reaches 90%, they have a large number of parameters. Specifically, the OA of ResNet50+SE [1] is 95.84%, the parameter size is M, the OA of Contourlet CNN [73] is 96.87%, the parameter size is M, and our classification accuracy reaches 97.31%, but the parameters are smaller than theirs.
- Compared with ResNet50+CBAM [1], VGG-VD-16 [10], and ResNet50+HFAM [1], our model decreased by 0.6592, 6.4765, and 0.5825, respectively, in the GMAC metric. The experimental results show that our model achieved better results in the above metrics, and the validity of our model is verified once again.
4.4. Ablation Experiment
4.5. Visual Comparison of Different Attention Mechanisms
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AID | Aerial Image Dataset |
BAM | Bottleneck Attention Module |
BN | Batch Normalization |
BoVW | Bag of Visual Words |
CBAM | Convolutional Block Attention Module |
CM | Confusion Matrix |
CNN | Convolutional Neural Network |
CV | Computer Vision |
GELU | Gaussian Error Linear Unit |
GFE | Global Feature Extraction |
GLF | Global-Local Fusion |
GMACs | Giga Multiply-Accumulation operations per Second |
HRRSI | High-Resolution Remote Sensing Image |
LBP | Local Binary Pattern |
LFE | Local Feature Extraction |
LN | Layer Normalization |
MCNN | Multi-scale Convolutional Neural Network |
MSA | Multi-head Attention |
NLP | Natural Language Processing |
NWPU-RESISC | Northwestern Polytechnical University Remote Sensing Image Scene Classification |
OA | Overall Accuracy |
SE | Squeeze and Excite |
RSSC | Remote Sensing Scene Classification |
RSICB | Remote Sensing Image Classification Benchmark |
ViT | Vision Transformer |
W-MAS | Windows Multi-head Self-attention |
References
- Wan, Q.; Qiao, Z.; Yu, Y.; Liu, Z.; Wang, K.; Li, D. A Hyperparameter-Free Attention Module Based on Feature Map Mathematical Calculation for Remote-Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5600318. [Google Scholar] [CrossRef]
- Xu, C.; Shu, J.; Zhu, G. Multi-Feature Dynamic Fusion Cross-Domain Scene Classification Model Based on Lie Group Space. Remote Sens. 2023, 15, 4790. [Google Scholar] [CrossRef]
- Xu, C.; Shu, J.; Zhu, G. Adversarial Remote Sensing Scene Classification Based on Lie Group Feature Learning. Remote Sens. 2023, 15, 914. [Google Scholar] [CrossRef]
- Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
- Bai, L.; Liu, Q.; Li, C.; Zhu, C.; Ye, Z.; Xi, M. A lightweight and multiscale network for remote sensing image scene classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8012605. [Google Scholar] [CrossRef]
- Bai, L.; Liu, Q.; Li, C.; Ye, Z.; Hui, M.; Jia, X. Remote sensing image scene classification using multiscale feature fusion covariance network with octave convolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5620214. [Google Scholar] [CrossRef]
- Vetrivel, A.; Gerke, M.; Kerle, N.; Nex, F.; Vosselman, G. Disaster damage detection through synergistic use of deep learning and 3D point cloud features derived from very high resolution oblique aerial images and multiple-kernel-learning. ISPRS J. Photogramm. Remote Sens. 2018, 140, 45–59. [Google Scholar] [CrossRef]
- Zheng, K.; Gao, L.; Hong, D.; Zhang, B.; Chanussot, J. NonRegSRNet: A nonrigid registration hyperspectral super-resolution network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5520216. [Google Scholar] [CrossRef]
- Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Observ. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
- Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Lu, X. AID: A benchmark dataset for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
- Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624915. [Google Scholar] [CrossRef]
- Wang, Z.; Chen, J.; Hoi, S.C.H. Deep learning for image super-resolution: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Wang, S.; Ning, C.; Zhou, H. Enhanced feature pyramid network with deep semantic embedding for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7918–7932. [Google Scholar] [CrossRef]
- Su, Y.; Gao, L.; Jiang, M.; Plaza, A.; Sun, X.; Zhang, B. NSCKL: Normalized spectral clustering with kernel-based learning for semisupervised hyperspectral image classification. IEEE Trans. Cybern. 2022, 53, 6649–6662. [Google Scholar] [CrossRef]
- Qin, A.; Chen, F.; Li, Q.; Tang, L.; Yang, F.; Zhao, Y.; Gao, C. Deep Updated Subspace Networks for Few-Shot Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606714. [Google Scholar] [CrossRef]
- Ma, A.; Wan, Y.; Zhong, Y.; Wang, J.; Zhang, L. SceneNet: Remote sensing scene classification deep learning network using multi-objective neural evolution architecture search. ISPRS J. Photogramm. Remote Sens. 2021, 172, 171–188. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Proc. Conf. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
- Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
- Lv, P.; Wu, W.; Zhong, Y.; Du, F.; Zhang, L. SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4409512. [Google Scholar] [CrossRef]
- Xu, K.; Deng, P.; Huang, H. Vision transformer: An excellent teacher for guiding small networks in remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4409512. [Google Scholar] [CrossRef]
- Huo, X.; Sun, G.; Tian, S.; Wang, Y.; Yu, L.; Long, J.; Zhang, W.; Li, A. HiFuse: Hierarchical multi-scale feature fusion network for medical image classification. Biomed Signal Process. 2024, 87, 105534. [Google Scholar] [CrossRef]
- Xu, Y.; Zhang, Q.; Zhang, J.; Tao, D. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Biomed Signal Process. 2021, 34, 28522–28535. [Google Scholar] [CrossRef]
- Fu, B.; Zhang, M.; He, J.; Cao, Y.; Guo, Y.; Wang, R. StoHisNet: A hybrid multi-classification model with CNN and transformer for gastric pathology images. Biomed Signal Process. 2021, 34, 28522–28535. [Google Scholar] [CrossRef]
- Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. MICCAI 2021 2021, 14–24. [Google Scholar] [CrossRef]
- Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12175–12185. [Google Scholar] [CrossRef]
- Peng, Z.; Huang, W.; Gu, S.; Xie, L.; Wang, Y.; Jiao, J.; Ye, Q. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 367–376. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
- Xu, C.; Shu, J.; Zhu, G. Scene Classification Based on Heterogeneous Features of Multi-Source Data. Remote Sens. 2023, 15, 325. [Google Scholar] [CrossRef]
- Xu, C.; Zhu, G.; Shu, J. A Combination of Lie Group Machine Learning and Deep Learning for Remote Sensing Scene Classification Using Multi-Layer Heterogeneous Feature Extraction and Fusion. Remote Sens. 2022, 14, 1445. [Google Scholar] [CrossRef]
- Xu, C.; Zhu, G.; Shu, J. A Lightweight and Robust Lie Group-Convolutional Neural Networks Joint Representation for Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5501415. [Google Scholar] [CrossRef]
- Xu, C.; Zhu, G.; Shu, J. Lie Group spatial attention mechanism model for remote sensing scene classification. Int. J. Remote Sens. 2022, 43, 2461–2474. [Google Scholar] [CrossRef]
- Huang, B.; Zhao, B.; Song, Y. Urban land-use mapping using a deep convolutional neural network with high spatial resolution multispectral remote sensing imagery. Remote Sens. Environ. 2018, 214, 73–86. [Google Scholar] [CrossRef]
- Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
- dos Santos, J.A.; Penatti, O.A.; Torres, R.D.S. Evaluating the potential of texture and color descriptors for remote sensing image retrieval and classification. ICCV 2010, 2, 203–208. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, New York, NY, USA, 2–5 November 2010; pp. 270–279. [Google Scholar] [CrossRef]
- Zhu, Q.; Zhong, Y.; Zhao, B.; Xia, G.-S.; Zhang, L. Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 2016, 6, 747–751. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
- Li, M.; Stein, A.; Bijker, W.; Zhan, Q. Urban land use extraction from very high resolution remote sensing imagery using a Bayesian network. ISPRS J. Photogramm. Remote Sens. 2016, 122, 192–205. [Google Scholar] [CrossRef]
- Zhang, F.; Du, B.; Zhang, L. Scene classification via a gradient boosting random convolutional network framework. IEEE Trans. Geosci. Remote Sens. 2016, 54, 1793–1802. [Google Scholar] [CrossRef]
- Lu, X.; Sun, H.; Zheng, X. A feature aggregation convolutional neural network for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7894–7906. [Google Scholar] [CrossRef]
- Liu, Y.; Zhong, Y.; Qin, Q. Scene classification based on multiscale convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7109–7121. [Google Scholar] [CrossRef]
- Tang, X.; Li, M.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. EMTCAL: Efficient multiscale transformer and cross-level attention learning for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5626915. [Google Scholar] [CrossRef]
- Chen, S.-B.; Wei, Q.-S.; Wang, W.-Z.; Tang, J.; Luo, B.; Wang, Z.-Y. Remote sensing scene classification via multi-branch local attention network. IEEE Trans. Image Process. 2022, 31, 99–109. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 2018, 31, 7132–7141. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. Proc. Adv. Neural Inf. Process. Syst. 2018, 31, 9423–9433. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Computer Vision—ECCV; Springer: Munich, Germany, 2018; pp. 3–19. [Google Scholar] [CrossRef]
- Song, H.; Deng, B.; Pound, M.; Özcan, E.; Triguero, I. A fusion spatial attention approach for few-shot learning. Inf. Fusion. 2022, 81, 187–202. [Google Scholar] [CrossRef]
- Qin, Z.; Wang, H.; Mawuli, C.B.; Han, W.; Zhang, R.; Yang, Q.; Shao, J. Multi-instance attention network for few-shot learning. Inf. Fusion. 2022, 611, 464–475. [Google Scholar] [CrossRef]
- Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.S. BAM: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar] [CrossRef]
- Zhang, Q.-L.; Yang, Y.-B. SA-Net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar] [CrossRef]
- Li, H.; Deng, W.; Zhu, Q.; Guan, Q.; Luo, J.C. Local-Global Context-Aware Generative Dual-Region Adversarial Networks for Remote Sensing Scene Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5402114. [Google Scholar] [CrossRef]
- Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 1155–1167. [Google Scholar] [CrossRef]
- Yu, D.; Guo, H.; Xu, Q.; Lu, J.; Zhao, C.; Lin, Y. Hierarchical attention and bilinear fusion for remote sensing image scene classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 13, 6372–6383. [Google Scholar] [CrossRef]
- Xu, C.; G, Z.; Shu, J. Robust Joint Representation of Intrinsic Mean and Kernel Function of Lie Group for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2020, 118, 796–800. [Google Scholar] [CrossRef]
- Xu, C.; G, Z.; Shu, J. A Lightweight Intrinsic Mean for Remote Sensing Classification With Lie Group Kernel Function. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1741–1745. [Google Scholar] [CrossRef]
- Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramania, V.N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV) 2018, 14, 839–847. [Google Scholar] [CrossRef]
- der Maaten, L.V.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. Available online: http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf (accessed on 5 May 2024).
- Zhao, Y.; Chen, Y.; Rong, Y.; Xiong, S.; Lu, X. Global-Group Attention Network With Focal Attention Loss for Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
- Li, H.; Dou, X.; Tao, C.; Hou, Z.; Chen, J.; Peng, J.; Deng, M.; Zhao, L. RSI-CB: A large-scale remote sensing image classification benchmark using crowdsourced data. Sensors 2020, 20, 1594. [Google Scholar] [CrossRef] [PubMed]
- Xu, K.; Huang, H.; Deng, P.; Li, Y. Deep feature aggregation framework driven by graph convolutional network for scene classification in remote sensing. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 5751–5765. [Google Scholar] [CrossRef]
- Tang, X.; Ma, Q.; Zhang, X.; Liu, F.; Ma, J.; Jiao, L. Attention consistent network for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 14, 2030–2045. [Google Scholar] [CrossRef]
- Li, W.; Wang, Z.; Wang, Y.; Wu, J.; Wang, J.; Jia, Y.; Gui, G. Classification of high spatial resolution remote sensing scenes methodusing transfer learning and deep convolutional neural network. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 13, 1986–1995. [Google Scholar] [CrossRef]
- Zhang, B.; Zhang, Y.; Wang, S. A Lightweight and Discriminative Model for Remote Sensing Scene Classification With Multidilation Pooling Module. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2019, 12, 2636–2653. [Google Scholar] [CrossRef]
- Wang, X.; Xu, M.; Xiong, X.; Ning, C. Remote Sensing Scene Classification Using Heterogeneous Feature Extraction and Multi-Level Fusion. IEEE Access 2020, 8, 217628–217641. [Google Scholar] [CrossRef]
- Chaib, S.; Liu, H.; Gu, Y.; Yao, H. Deep feature fusion for VHR remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4775–4784. [Google Scholar] [CrossRef]
- Anwer, R.M.; Khan, F.S.; van de Weijer, J.; Molinier, M.; Laaksonen, J. Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification. ISPRS J. Photogramm. Remote Sens. 2018, 138, 74–85. [Google Scholar] [CrossRef]
- Bi, Q.; Qin, K.; Zhang, H.; Xie, J.; Li, Z.; Xu, K. APDC-Net: Attention pooling-based convolutional network for aerial scene classification. Remote Sens. Lett. 2019, 9, 1603–1607. [Google Scholar] [CrossRef]
- Aral, R.A.; Keskin, Ş.R.; Kaya, M.; Hacıömeroğlu, M. Classification of trashnet dataset based on deep learning models. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 1986–1995. [Google Scholar] [CrossRef]
- Pan, H.; Pang, Z.; Wang, Y.; Wang, Y.; Chen, L. A New Image Recognition and Classification Method Combining Transfer Learning Algorithm and MobileNet Model for Welding Defects. IEEE Access 2020, 8, 119951–119960. [Google Scholar] [CrossRef]
- Pour, A.M.; Seyedarabi, H.; Jahromi, S.H.A.; Javadzadeh, A. Automatic Detection and Monitoring of Diabetic Retinopathy using Efficient Convolutional Neural Networks and Contrast Limited Adaptive Histogram Equalization. IEEE Access 2020, 8, 136668–136673. [Google Scholar] [CrossRef]
- Yu, Y.; Liu, F. A two-stream deep fusion framework for high-resolution aerial scene classification. Comput. Intell. Neurosci. 2018, 2018, 1986–1995. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Liu, Y.; Ding, L. Scene classification based on two-stage deep feature fusion. IEEE Geosci. Remote Sens. Lett. 2018, 15, 183–186. [Google Scholar] [CrossRef]
- Liu, M.; Jiao, L.; Liu, X.; Li, L.; Liu, F.; Yang, S. C-CNN: Contourlet convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 2636–2649. [Google Scholar] [CrossRef]
- Sun, X.; Zhu, Q.; Qin, Q. A Multi-Level Convolution Pyramid Semantic Fusion Framework for High-Resolution Remote Sensing Image Scene Classification and Annotation. IEEE Access 2021, 9, 18195–18208. [Google Scholar] [CrossRef]
- Wang, J.; Zhong, Y.; Zheng, Z.; Ma, A.; Zhang, L. RSNet: The search for remote sensing deep neural networks in recognition tasks. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2520–2534. [Google Scholar] [CrossRef]
- Ma, A.; Yu, N.; Zheng, Z.; Zhong, Y.; Zhang, L. A Supervised Progressive Growing Generative Adversarial Network for Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5618818. [Google Scholar] [CrossRef]
- Zheng, J.; Wu, W.; Yuan, S.; Zhao, Y.; Li, W.; Zhang, L.; Dong, R.; Fu, H. A Two-Stage Adaptation Network (TSAN) for Remote Sensing Scene Classification in Single-Source-Mixed-Multiple-Target Domain Adaptation (S2M2T DA) Scenarios. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5609213. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3–7 May 2021; pp. 1–22. [Google Scholar] [CrossRef]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 538–547. [Google Scholar] [CrossRef]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
- Zhao, Z.; Li, J.; Luo, Z.; Li, J.; Chen, C. Remote sensing image scene classification based on an enhanced attention module. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1926–1930. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern RECOGNITION (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
- He, N.; Fang, L.; Li, S.; Plaza, J.; Plaza, A. Skip-connected covariance network for remote sensing scene classification. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 1461–1474. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Item | Content |
---|---|
CPU | Inter Core i7-4700 CPU with 2.70 GHz |
Memory | 32 GB |
Operating system | CentOS 7.8 64 bit |
Hard disk | 1TB |
GPU | Nvidia Titan-X |
Python | 3.7.2 |
PyTorch | 1.4.0 |
CUDA | 10.0 |
Learning rate | |
Momentum | 0.73 |
Weight decay | |
Batch | 16 |
Saturation | 1.7 |
Subdivisions | 64 |
Models | 50% |
---|---|
CaffeNet [10] | |
VGG-VD-16 [10] | |
GoogLeNet [10] | |
Fusion by addition [65] | |
LGRIN [30] | |
TEX-Net-LF [66] | |
DS-SURF-LLC+Mean-Std-LLC+ MO-CLBP-LLC [64] | |
LiG with RBF kernel [55] | |
ADPC-Net [67] | |
VGG19 [62] | |
ResNet50 [1] | |
ResNet50+SE [1] | |
ResNet50+CBAM [1] | |
ResNet50+HFAM [1] | |
InceptionV3 [62] | |
DenseNet121 [68] | |
DenseNet169 [68] | |
MobileNet [69] | |
EfficientNet [70] | |
Two-stream deep fusion Framework [71] | |
Fine-tune MobileNet V2 [63] | |
SE-MDPMNet [63] | |
Two-stage deep feature Fusion [72] | |
Contourlet CNN [73] | |
LCPP [74] | |
RSNet [75] | |
SPG-GAN [76] | |
TSAN [77] | |
LGDL [29] | |
ViT-B-16 [78] | |
T2T-ViT-12 [79] | |
PVT-V2-B0 [80] | |
VGG16+CBAM [1] | |
VGG16+SE [1] | |
VGG16+HFAM [1] | |
Proposed |
Models | Acc (%) | Parameters (M) | GMACs (G) | Velocity (Samples/s) |
---|---|---|---|---|
CaffeNet [10] | 88.91 | 60.97 | 3.6532 | 32 |
GoogLeNet [10] | 85.67 | 7 | 0.7500 | 37 |
VGG-VD-16 [10] | 89.36 | 138.36 | 7.7500 | 35 |
LiG with RBF kernel [55] | 96.22 | 2.07 | 0.2351 | 43 |
ResNet50 [1] | 95.51 | 25.58 | 1.8555 | 38 |
ResNet50+SE [1] | 95.84 | 26.28 | 1.9325 | 38 |
ResNet50+CBAM [1] | 95.38 | 26.29 | 1.9327 | 38 |
ResNet50+HFAM [1] | 95.86 | 25.58 | 1.8556 | 38 |
Inception V3 [62] | 94.97 | 45.37 | 2.4356 | 21 |
Contourlet CNN [73] | 96.87 | 12.6 | 1.0583 | 35 |
SPG-GAN [76] | 87.36 | 2.1322 | 29 | |
TSAN [77] | 381.67 | 3.2531 | 32 | |
Proposed | 97.31 | 12.216 | 1.2735 | 45 |
Component | OA |
---|---|
Global Feature Extraction | |
+Local Feature Extraction | |
+Channel Attention | |
+Spatial Attention | |
+Fusion Module |
Model | Stage 1 | Stage 2 | Stage 3 | Stage 4 | OA |
---|---|---|---|---|---|
Ours | √ | 90.32% | |||
√ | √ | 91.15% | |||
√ | √ | √ | 93.27% | ||
√ | √ | √ | √ | 95.09% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, C.; Shu, J.; Wang, Z.; Wang, J. A Scene Classification Model Based on Global-Local Features and Attention in Lie Group Space. Remote Sens. 2024, 16, 2323. https://doi.org/10.3390/rs16132323
Xu C, Shu J, Wang Z, Wang J. A Scene Classification Model Based on Global-Local Features and Attention in Lie Group Space. Remote Sensing. 2024; 16(13):2323. https://doi.org/10.3390/rs16132323
Chicago/Turabian StyleXu, Chengjun, Jingqian Shu, Zhenghan Wang, and Jialin Wang. 2024. "A Scene Classification Model Based on Global-Local Features and Attention in Lie Group Space" Remote Sensing 16, no. 13: 2323. https://doi.org/10.3390/rs16132323
APA StyleXu, C., Shu, J., Wang, Z., & Wang, J. (2024). A Scene Classification Model Based on Global-Local Features and Attention in Lie Group Space. Remote Sensing, 16(13), 2323. https://doi.org/10.3390/rs16132323