Optimizing Multimodal Scene Recognition through Mutual Information-Based Feature Selection in Deep Learning Models
Abstract
:1. Introduction
- Our approach differs from previous models that primarily focused on single-modal processing by introducing a multimodal fusion technique that works on the same input visual data. The present methodology introduces a novel approach that allows for the simultaneous processing of many types of information derived from a singular image. This advancement facilitates a more thorough comprehension of the scene compared to prior research that predominantly concentrated on single-modal techniques.
- To enhance feature quality and robustness, we employ MI-based feature selection. Our approach incorporates MI not only for feature selection but also within the context of multimodal fusion. We utilize MI to assess the interdependence between different data modalities and scene labels, enabling effective fusion strategies that are tailored to each modality’s relationship with scene categories.
- The model incorporates an additional layer for integrating optimized features extracted from the input image data, representing an innovative architectural element. This feature integration step significantly enhances the quality of feature representations, further contributing to the model’s superior performance compared to techniques that do not incorporate such integration.
- Our method introduces the novel concept of an end-to-end model, eliminating the need for complex multistage pipelines typically employed in traditional scene recognition approaches.
2. Literature Review
3. Methodology and Materials
3.1. Our Multimodal Deep Learning Structure
3.2. MI for Feature Selection
- (1)
- Feature Selection: MI helps identify and select the most informative features or modalities for scene recognition. By calculating MI between each feature (F) and the target scene label (L), less relevant or redundant features can be pruned, enhancing computational efficiency and reducing the risk of overfitting. The formula for this feature selection process is:
- MI(F,L) represents the MI between the feature F and the scene label L.
- p(f,l) is the joint probability mass function of the feature F and the scene label L.
- p(f) and p(l) are the marginal probability mass functions of the feature F and the scene label L, respectively.
- (2)
- Multimodal Fusion: In multimodal scene recognition, MI quantifies the interdependence between different data modalities (e.g., text, audio, depth) and the scene labels (L). This assists in determining how to combine modalities effectively, such as weighing them based on their MI with the scene label. The weighted fusion formula can be represented as:
- Weighted Fusion represents the combined information from multiple modalities.
- MI(Mi,L) is the mutual information between each modality (Mi) and the scene label L.
- w1, w2, etc., are weights assigned to each modality, which can be determined based on their MI values.
- (3)
- Optimization: MI can be employed in optimization processes to improve feature representations. Genetic Algorithms, for example, can utilize MI as an objective function, enhancing the discriminative power and robustness of features extracted from the same input image data. The objective function for optimization can be defined as:
- MI(Fi,L) represents the mutual information between each feature (Fi) and the scene label L.
- n is the number of features under consideration.
- As an initial step, we gather a dataset comprising feature vectors from the first fully connected layer of both models in our multimodal architecture. This dataset also includes the corresponding scene labels.
- For each feature in the dataset, MI is calculated with respect to the scene labels, following the MI formula. Importantly, this MI calculation takes place after the fusion of the two models.
- Based on the computed MI scores, we rank the features in descending order. Higher MI scores signify stronger MI content between selected features and scene labels.
- The top N features with the highest MI scores are chosen, considering factors such as model complexity and computational efficiency. These features form a feature subset.
- An additional layer for MI feature selection is incorporated after the fusion of the two models. This layer selects suitable features from both models and combines them into a unified feature vector.
- The combined feature vector is integrated into the deep model, enhancing its feature representation for scene recognition tasks.
3.3. Dataset Used
3.3.1. Scene Classification Dataset
3.3.2. The AID Dataset
3.4. Scene Recognition
4. Results and Analysis
4.1. Experimental Setup
4.2. Results on Scene Classification Dataset
4.3. Results on AID Dataset
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Xie, L.; Lee, F.; Liu, L.; Kotani, K.; Chen, Q. Scene recognition: A comprehensive survey. Pattern Recognit. 2020, 102, 107205. [Google Scholar]
- Moeslund, T.B.; Hilton, A.; Krüger, V. A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 2006, 104, 90–126. [Google Scholar]
- Gupta, A.; Anpalagan, A.; Guan, L.; Khwaja, A.S. Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues. Array 2021, 10, 100057. [Google Scholar]
- Zhang, J.; Zhu, C.; Zheng, L.; Xu, K. ROSEFusion: Random optimization for online dense reconstruction under fast camera motion. ACM Trans. Graph. TOG 2021, 40, 1–17. [Google Scholar]
- Saber, S.; Amin, K.; Pławiak, P.; Tadeusiewicz, R.; Hammad, M. Graph convolutional network with triplet attention learning for person re-identification. Inf. Sci. 2022, 617, 331–345. [Google Scholar]
- Saber, S.; Meshoul, S.; Amin, K.; Pławiak, P.; Hammad, M. A Multi-Attention Approach for Person Re-Identification Using Deep Learning. Sensors 2023, 23, 3678. [Google Scholar]
- Guan, T.; Wang, C.H. Registration based on scene recognition and natural features tracking techniques for wide-area augmented reality systems. IEEE Trans. Multimed. 2009, 11, 1393–1406. [Google Scholar]
- Pawar, P.G.; Devendran, V. Scene understanding: A survey to see the world at a single glance. In Proceedings of the 2019 2nd International Conference on Intelligent Communication and Computational Techniques (ICCT), Jaipur, India, 28–29 September 2019; IEEE: New York, NY, USA; pp. 182–186. [Google Scholar]
- Huang, N.; Liu, Y.; Zhang, Q.; Han, J. Joint cross-modal and unimodal features for RGB-D salient object detection. IEEE Trans. Multimed. 2020, 23, 2428–2441. [Google Scholar]
- Hua, Y.; Mou, L.; Lin, J.; Heidler, K.; Zhu, X.X. Aerial scene understanding in the wild: Multi-scene recognition via prototype-based memory networks. ISPRS J. Photogramm. Remote Sens. 2021, 177, 89–102. [Google Scholar]
- Petrovska, B.; Atanasova-Pacemska, T.; Corizzo, R.; Mignone, P.; Lameski, P.; Zdravevski, E. Aerial scene classification through fine-tuning with adaptive learning rates and label smoothing. Appl. Sci. 2020, 10, 5792. [Google Scholar]
- Wang, X.; Yuan, L.; Xu, H.; Wen, X. CSDS: End-to-end aerial scenes classification with depthwise separable convolution and an attention mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10484–10499. [Google Scholar]
- Zhao, X.; Zhang, J.; Tian, J.; Zhuo, L.; Zhang, J. Residual dense network based on channel-spatial attention for the scene classification of a high-resolution remote sensing image. Remote Sens. 2020, 12, 1887. [Google Scholar]
- Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar]
- Wang, H.; Yu, Y. Deep feature fusion for high-resolution aerial scene classification. Neural Process. Lett. 2020, 51, 853–865. [Google Scholar]
- Wu, H.; Xu, C.; Liu, H. S-MAT: Semantic-Driven Masked Attention Transformer for Multi-Label Aerial Image Classification. Sensors 2022, 22, 5433. [Google Scholar]
- Pires de Lima, R.; Marfurt, K. Convolutional neural network for remote-sensing scene classification: Transfer learning analysis. Remote Sens. 2019, 12, 86. [Google Scholar]
- Sharma, T.; Debaque, B.; Duclos, N.; Chehri, A.; Kinder, B.; Fortier, P. Deep learning-based object detection and scene perception under bad weather conditions. Electronics 2022, 11, 563. [Google Scholar]
- Wang, S.; Yao, S.; Niu, K.; Dong, C.; Qin, C.; Zhuang, H. Intelligent scene recognition based on deep learning. IEEE Access 2021, 9, 24984–24993. [Google Scholar]
- Afif, M.; Ayachi, R.; Said, Y.; Atri, M. Deep learning based application for indoor scene recognition. Neural Process. Lett. 2020, 51, 2827–2837. [Google Scholar]
- Dhanaraj, M.; Sharma, M.; Sarkar, T.; Karnam, S.; Chachlakis, D.; Ptucha, R.; Markopoulos, P.P.; Saber, E. Vehicle detection from multi-modal aerial imagery using YOLOv3 with mid-level fusion. In Proceedings of the Big Data II: Learning, Analytics, and Applications, Online, 15 May 2020; SPIE: New York, NY, USA; Volume 11395, pp. 22–32. [Google Scholar]
- Shahzad, H.M.; Bhatti, S.M.; Jaffar, A.; Rashid, M.; Akram, S. Multi-Modal CNN Features Fusion for Emotion Recognition: A Modified Xception Model. IEEE Access 2023, 11, 94281–94289. [Google Scholar]
- Xu, H.; Huang, C.; Huang, X.; Huang, M. Multi-modal multi-concept-based deep neural network for automatic image annotation. Multimed. Tools Appl. 2019, 78, 30651–30675. [Google Scholar]
- Doquire, G.; Verleysen, M. Mutual information-based feature selection for multilabel classification. Neurocomputing 2013, 122, 148–155. [Google Scholar]
- Hu, Q.; Zhang, L.; Zhang, D.; Pan, W.; An, S.; Pedrycz, W. Measuring relevance between discrete and continuous features based on neighborhood mutual information. Expert Syst. Appl. 2011, 38, 10737–10750. [Google Scholar]
- Liu, X.; Wang, S.; Lu, S.; Yin, Z.; Li, X.; Yin, L.; Tian, J.; Zheng, W. Adapting Feature Selection Algorithms for the Classification of Chinese Texts. Systems 2023, 11, 483. [Google Scholar]
- Lu, S.; Ding, Y.; Liu, M.; Yin, Z.; Yin, L.; Zheng, W. Multiscale feature extraction and fusion of image and text in VQA. Int. J. Comput. Intell. Syst. 2023, 16, 54. [Google Scholar]
- Nitisha. Scene Classification. 2018. Available online: https://www.kaggle.com/datasets/nitishabharathi/scene-classification (accessed on 17 September 2023).
- JayChen. AID: A Scene Classification Dataset. 2022. Available online: https://www.kaggle.com/datasets/jiayuanchengala/aid-scene-classification-datasets (accessed on 17 September 2023).
- Manning, C.D. An Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Name | Type | Activation | Learnables | |
---|---|---|---|---|
1 | Scene Images 200 × 200 × 3 | Image Input | 200 × 200 × 3 | - |
2 | Conv_1 1283 × 3 × 3 with stride [1 1] and padding ‘same’ | Convolution | 200 × 200 × 128 | Weights 3 × 3 × 3 × 128 Bias 1 × 1× 128 |
3 | Maxpool_1 2 × 2 with padding ‘same’ | Max Pooling | 200 × 200 × 128 | - |
4 | Relu_1 ReLU | ReLU | 200 × 200 × 128 | - |
5 | Conv_3 643 × 3 × 128 with stride [1 1] and padding ‘same’ | Convolution | 200 × 200 × 64 | Weights 3 × 3 × 128 × 64 Bias 1 × 1× 64 |
6 | Maxpool_3 2 × 2 with padding ‘same’ | Max Pooling | 200 × 200 × 64 | - |
7 | Relu_3 ReLU | ReLU | 200 × 200 × 64 | - |
8 | Conv_5 323 × 3 × 64 with stride [1 1] and padding ‘same’ | Convolution | 200 × 200 × 32 | Weights 3 × 3 × 64 × 32 Bias 1 × 1 × 32 |
9 | Maxpool_5 2 × 2 with padding ‘same’ | Max Pooling | 200 × 200 × 32 | - |
10 | Relu_5 ReLU | ReLU | 200 × 200 × 32 | - |
11 | Fc_1 1024 | Fully Connected | 1 × 1 × 1024 | Weights 1024 × 1,280,000 Bias 1024 × 1 |
12 | Conv_2 1283 × 3×3 with stride [1 1] and padding ‘same’ | Convolution | 200 × 200 × 128 | Weights 3 × 3 × 3 × 128 Bias 1 × 1 × 128 |
13 | Maxpool_2 2 × 2 with padding ‘same’ | Max Pooling | 200 × 200 × 128 | - |
14 | Relu_2 ReLU | ReLU | 200 × 200 × 128 | - |
15 | Conv_4 643 × 3 × 128 with stride [1 1] and padding ‘same’ | Convolution | 200 × 200 × 64 | Weights 3 × 3 × 128 × 64 Bias 1 × 1 × 64 |
16 | Maxpool_4 2 × 2 with padding ‘same’ | Max Pooling | 200 × 200 × 64 | - |
17 | Relu_4 ReLU | ReLU | 200 × 200 × 64 | - |
18 | Conv_6 323 × 3 × 64 with stride [1 1] and padding ‘same’ | Convolution | 200 × 200 × 32 | Weights 3 × 3 × 64 × 32 Bias 1 × 1 × 32 |
19 | Maxpool_6 2 × 2 with padding ‘same’ | Max Pooling | 200 × 200 × 32 | - |
20 | Relu_6 ReLU | ReLU | 200 × 200 × 32 | - |
21 | Fc_2 1024 | Fully Connected | 1 × 1 × 1024 | Weights 1024 × 1,280,000 Bias 1024 × 1 |
22 | Addition Eliminate-wise addition of 2 input | Addition | 1 × 1 × 1024 | - |
23 | Relu_7 ReLU | ReLU | 1 × 1 × 1024 | - |
24 | Feature Selection Layer 600 fully connected | Fully Connected | 1 × 1 × 600 | Weights 600 × 1024 Bias 600 × 1 |
25 | Relu_8 ReLU | ReLU | 1 × 1 × 600 | - |
26 | Dropout 50% dropout | Dropout | 1 × 1 × 600 | - |
27 | Fc_3 6 | Fully Connected | 1 × 1 × 6 | Weights 6 × 600 Bias 6 × 1 |
28 | Softmax | Softmax | 1 × 1 × 6 | - |
29 | Classoutput Crossentropyex with ‘Buildings’ and other 5 classes | Classification output | - | - |
Hyperparameter | Value |
---|---|
Optimizer | Adam |
Mini-Batch Size | 4 |
Kernel Size | 3 × 3 for Convolution 2 × 2 for Max pooling |
Number of kernels | Conv_1 and Conv_2 = 128 Conv_3 and Conv_4 = 64 Conv_5 and Conv_6 = 32 |
Number of Nodes in FC layers | Fc_1 and Fc_2 = 1024 Fc_3 = 6 |
Maximum Epochs | 30 |
Initial Learning Rate | 3.0000000 × 10−4 |
Learning Rate Schedule | Piecewise |
Learning Rate Drop Factor | 0.5 |
Learning Rate Drop Period | 5 |
Data Shuffling | Every Epoch |
Validation Frequency | 87 |
Database | Accuracy (%) | Precision (%) | Recall (%) | F1 Score (%) |
---|---|---|---|---|
Scene Classification [28] | 100 | 100 | 100 | 100 |
AID [29] | 98.83 | 98.83 | 98.83 | 98.83 |
Ref./Author | Methods | Dataset | Performance |
---|---|---|---|
Hua et al. [10] | Multi-scene recognition network | AID | Precision = 64.03% Recall = 52.30% F1 Score = 52.39% |
Petrovska et al. [11] | Pre-trained deep models + SVM | AID | Overall accuracy = 93.58% |
Wang et al. [12] | CSDS model | AID | Overall accuracy = 94.29% |
Zhao et al. [13] | Residual dense network | AID | Accuracy = 99% |
Bazi et al. [14] | Vision Transformers | AID | Best Accuracy = 95.51% |
Wang and Yu [15] | Deep learning | AID | Mean Accuracy = 93.70% |
Wu et al. [16] | SDM + MAT | AID | Accuracy = 90.90% |
Lima and Marfurt [17] | CNN | AID | Accuracy = 94.10% |
Our | CNN + MI | AID | Accuracy = 98.83% Precision = 98.83% Recall = 98.83% F1 Score = 98.83% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hammad, M.; Chelloug, S.A.; Alayed, W.; El-Latif, A.A.A. Optimizing Multimodal Scene Recognition through Mutual Information-Based Feature Selection in Deep Learning Models. Appl. Sci. 2023, 13, 11829. https://doi.org/10.3390/app132111829
Hammad M, Chelloug SA, Alayed W, El-Latif AAA. Optimizing Multimodal Scene Recognition through Mutual Information-Based Feature Selection in Deep Learning Models. Applied Sciences. 2023; 13(21):11829. https://doi.org/10.3390/app132111829
Chicago/Turabian StyleHammad, Mohamed, Samia Allaoua Chelloug, Walaa Alayed, and Ahmed A. Abd El-Latif. 2023. "Optimizing Multimodal Scene Recognition through Mutual Information-Based Feature Selection in Deep Learning Models" Applied Sciences 13, no. 21: 11829. https://doi.org/10.3390/app132111829
APA StyleHammad, M., Chelloug, S. A., Alayed, W., & El-Latif, A. A. A. (2023). Optimizing Multimodal Scene Recognition through Mutual Information-Based Feature Selection in Deep Learning Models. Applied Sciences, 13(21), 11829. https://doi.org/10.3390/app132111829