Multimodal Natural Disaster Scene Recognition with Integrated Large Model and Mamba

Shao, Yuxuan; Xu, Liwen

doi:10.3390/app15031149

Open AccessArticle

Multimodal Natural Disaster Scene Recognition with Integrated Large Model and Mamba

by

Yuxuan Shao

and

Liwen Xu

^*

College of Science, North China University of Technology, Beijing 100144, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1149; https://doi.org/10.3390/app15031149

Submission received: 4 December 2024 / Revised: 16 January 2025 / Accepted: 22 January 2025 / Published: 23 January 2025

(This article belongs to the Special Issue Deep Learning for Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate identification of natural disasters is crucial in ensuring effective post-disaster relief efforts. However, the existing models for disaster classification often incur high costs. To address this, we propose leveraging the most advanced pre-trained large language models, which offer superior generative and multimodal understanding capabilities. Using a question-answering approach, we extract textual descriptions and category prediction probabilities for disaster scenarios, which are then used as input to our proposed Mamba Multimodal Disaster Recognition Network (Mamba-MDRNet). This model integrates a large pre-trained model with the Mamba mechanism, enabling the selection of the most reliable modality information as a robust basis for scene classification. Extensive experiments demonstrate consistent performance improvements across various visual models with heterogeneous architectures. Notably, integrating EfficientNet within Mamba-MDRNet yielded 97.82% accuracy for natural scene classification, surpassing the performance of the CNN (91.75%), ViT (94.50%), and ResNet18 (97.25%). These results highlight the potential of multimodal models combining large models and the Mamba mechanism for disaster type prediction.

Keywords:

Mamba-MDRNet; multimodal; large language models; natural disasters

1. Introduction

The sudden and uncontrollable nature of natural disasters makes them a significant obstacle to human development [1]. According to the global disaster database EM-DAT, maintained by KU Leuven University in Belgium, 366 major natural disasters occurred worldwide in 2023, resulting in 82,151 fatalities, approximately 79.39 million injuries, and economic losses totaling USD 159.8 billion. These events not only pose immediate threats to human life and cause significant economic damage but also trigger secondary disasters that can devastate ecosystems and create long-term public health risks. Therefore, the timely and accurate identification of disaster scenarios is crucial in preventing secondary disasters and aiding post-disaster relief efforts. However, obtaining such information during emergencies is labor-intensive and costly, often requiring manual data processing and expert evaluation [2]. To alleviate this, computer vision techniques have been applied to satellite images, synthetic aperture radar, and other remote sensing data [3,4,5]. Unfortunately, these approaches remain costly to deploy and lack robustness in time-critical situations.

Large-scale pre-trained models offer an effective solution. Through pre-training and fine-tuning paradigms, large language models (LLMs) [6] excel in a variety of tasks, such as sentiment analysis [7], image recognition [8], text completion [9], and cross-lingual translation [10]. In 2023, Google DeepMind introduced Gemini1.5, a high-performance multimodal model series that incorporates state-of-the-art innovations in sparse and dense scaling, training infrastructure, and distillation, achieving breakthroughs in efficiency, reasoning, planning, multilingual processing, function calling, and long-context performance [11]. Gemini1.5 also surpasses contemporary LLMs in recall and reasoning capabilities. However, the high computational costs associated with training GPT-4-level models make it nearly impossible for small enterprises to train such models directly for their business needs. In contrast, leveraging pre-trained large models can significantly reduce the operational costs. Additionally, due to the inherent limitations in the generative capabilities of chatbot models, they face challenges in rapidly recognizing images from disaster scenarios. Traditional visual models, such as ResNet and EfficientNet, remain superior for tasks like image classification and visual perception [12]. Therefore, utilizing pre-trained LLMs to enhance disaster prediction performance appears to be a more practical and cost-effective approach.

In traditional supervised training paradigms, vision models learn visual representations solely from pixel data and labels, neglecting information from other modalities. In this paper, we propose a novel learning framework, the Mamba Multimodal Disaster Recognition Network (Mamba-MDRNet), inspired by the GPT4 Image model by N. Ding et al. [13]. Our approach leverages knowledge from pre-trained large models to enhance the multimodal learning capabilities of conventional vision models, thereby improving the performance in perception tasks such as image classification. Specifically, we first utilize the multimodal understanding of pre-trained LLMs through dialog to generate detailed, high-quality descriptions and predicted probabilities for each image in the dataset. These probabilities sum to one across all classes. We then process the descriptive text using a convolutional neural network (CNN) [14] to extract text feature vectors, and the predicted probabilities are transformed into probabilistic feature vectors using a multi-layer perceptron (MLP) [15].

During training, in addition to supervision from ground-truth labels, we employ a unified attention mechanism to fuse multimodal data—images, text, and probabilities—enhancing the model’s ability to capture relationships across modalities. This enables the transfer of the image understanding knowledge from LLMs to traditional visual perception models. Furthermore, as Mamba outperforms Transformer in handling long sequences, sparse data, and computational efficiency, we use the Mamba selective state-space model to process the stacked vectors of images, text, and probabilities, further improving the model accuracy.

We validate the effectiveness of the proposed Mamba-MDRNet algorithm on a natural disaster scene dataset. Extensive experiments demonstrate that our method outperforms traditional vision models. The natural disaster dataset and code for Mamba-MDRNet are available at https://github.com/22Shao/Mamba-MDRNet (accessed on 3 December 2024).

In this study, we present our primary contributions as follows.

1: We enhance multimodal models by incorporating the predicted probabilities from large models as an additional modality.
2: We integrate the Mamba selective state-space model into the multimodal framework to enable the model to select the most relevant modality information.
3: We propose the Mamba-MDRNet framework and demonstrate its superior performance in detecting natural disaster data compared to traditional visual models.

2. Related Work

2.1. Large-Scale Pre-Trained Models

Prior to the rise of large-scale pre-trained models, natural language processing (NLP) and computer vision tasks primarily relied on feature engineering and traditional machine learning methods. These approaches required the manual design of features by experts, such as Bag of Words (BoW) [16] and TF-IDF [17] for text processing and SIFT [18] and HOG [19] for image processing. While these techniques achieved some early success, they were heavily dependent on hand-crafted features, limiting their scalability to more complex tasks. The advent of recurrent neural networks (RNNs) and their enhanced variants, long short-term memory (LSTM) networks, enabled the automatic learning of temporal relationships and contextual information in sequences, advancing the capability to handle complex tasks. However, these models still faced limitations related to long-term dependencies and training efficiency [20].

In 2017, Google introduced the Transformer architecture, which leveraged self-attention mechanisms to overcome the sequential processing constraints of RNNs [21]. This breakthrough enabled the efficient parallel processing of input data, laying the foundation for the development of large-scale pre-trained models. Today, pre-trained models can be broadly classified into large language models (LLMs) [6] and vision language models (VLMs) [22].

LLMs are designed specifically for NLP tasks and are pre-trained on vast amounts of text data to generate high-quality language representations, which can then be fine-tuned for a wide range of downstream tasks. Common examples of LLMs include the GPT series [23] and the Text-To-Text Transfer Transformer (T5) [24]. VLMs, on the other hand, are capable of processing both visual and linguistic information, making them ideal for cross-modal tasks such as image captioning and visual question answering (VQA). Notable VLMs include DALL·E [25], BLIP (Bootstrapping Language-Image Pretraining) [26], and Florence [27].

Both LLMs and VLMs have demonstrated remarkable success in pre-trained tasks, being widely applied in text generation, image synthesis, and VQA. In this paper, we leverage the widely used large model Gemini 1.5 in a question-answering framework to generate detailed descriptive text and category probabilities for natural disaster images, thereby obtaining bimodal features.

2.2. Multimodal Tasks

Artificial intelligence (AI) was originally inspired by human perception, such as seeing, hearing, touching, and smelling. The human sensory system integrates information from these different senses to make more accurate judgments [28]. For instance, during conversations, we use our eyes to observe facial expressions and our ears to listen to speech, allowing us to better understand the dialog. Similarly, if an intelligent system can ingest, interpret, and reason about information from multiple modalities, it could achieve human-level perceptual abilities. Multimodal learning (MML) [29] is a general approach to building AI models that extract and correlate information from diverse modalities.

Early multimodal fusion techniques primarily relied on simple feature concatenation or manually designed rules, which limited models’ complexity. With the advent of deep learning, multimodal techniques have developed rapidly. Neural networks, especially convolutional neural networks (CNNs) and recurrent neural networks (RNNs), facilitated the deeper fusion of multimodal information. At this stage, most multimodal fusion models concatenated feature representations from different modalities at specific layers, followed by fully connected layers for information fusion. Although this shallow fusion method achieved some success, it exhibited limitations in handling complex dependencies.

With the widespread application of Transformer models in NLP and computer vision, cross-modal learning has become a key development in the multimodal field. The attention mechanism within Transformer models naturally captures cross-modal dependencies, allowing for more effective information fusion. In this paper, we not only incorporate common image and text modalities but also introduce abstract modality information derived from class probabilities produced by large models. This abstract modality of class probabilities enhances the cross-modal consistency, simplifies the feature space, and improves the model’s generalization abilities. Furthermore, class probabilities serve as a unified semantic representation, helping the model to better align features from different modalities and facilitating knowledge sharing in attention mechanisms and multi-task learning.

2.3. Mamba Selective State-Space Model

The state-space model (SSM) [30] is a statistical model used to represent dynamic systems, describing the evolution of a system’s internal hidden states over time. The objective of SSMs is to infer these hidden dynamic states based on observable data. SSMs typically consist of two fundamental equations.

State update equation

h_{t} = f (h_{t - 1}, u_{t}, θ) + ϵ_{t}

(1)

where

h_{t}

is the hidden state,

u_{t}

represents the input,

θ

denotes model parameters, and

ϵ_{t}

is a noise term.

Observation equation

y_{t} = g (h_{t}, θ) + η_{t}

(2)

where

y_{t}

is the observable output,

h_{t}

is the hidden state, and

η_{t}

is a noise term.

The key innovation of the Mamba model [31] lies in the introduction of a selective mechanism into the state-space model, enabling the system to dynamically focus on important states. This selective mechanism can be formalized as

s_{t} = Select (h_{t}, h_{t - 1}, h_{t - 2}, \dots, h_{0})

(3)

Unlike standard SSMs that sequentially update hidden states, Mamba’s selective mechanism allows the model to update information based on multiple past states, rather than relying solely on the immediate previous state. The updated state can be expressed as

h_{t} = Update (h_{t - 1}, u_{t}, θ, Select (h_{t}))

(4)

This selective update mechanism enables Mamba, compared to Transformer models, to efficiently process data over longer time spans while maintaining temporal continuity and dependencies [32]. It also enhances the information extraction efficiency. In this work, we incorporate the Mamba model into multimodal tasks, leveraging Mamba’s selective mechanism to dynamically focus on relevant multimodal features. This improves the quality of information fusion, avoiding redundant updates and irrelevant information accumulation. Given the complexity of the scenarios addressed, Mamba’s capability to handle long-term dependencies across modalities optimizes the computational resources, significantly reducing the complexity and enabling the rapid identification of natural disaster scenarios, aiding in predictive disaster relief planning.

3. Methods

In this section, we first outline the overall design of the proposed training framework, Mamba-MDRNet. Then, we describe the process of generating image descriptions and class probability vectors using a pre-trained large model. Finally, we explain how these textual descriptions and probability vectors are used as supervisory signals during the training phase to help the visual model to learn more effective representations, enabling the faster and more accurate identification of natural disaster scenes in images.

3.1. Overall Framework

Conventional visual models typically establish a direct connection between ground-truth labels and raw image features. However, previous research has demonstrated that learning from multimodal information can significantly enhance models’ performance, as seen in CLIP (A. Radford, et al., 2021) [33] and UNITER (Y.-C. Chen et al., 2020) [34], both of which employ unified multimodal representations to tackle vision and language tasks. To leverage diverse modalities for image understanding, we utilize a popular pre-trained large model capable of comprehending and generating multimodal information, thereby incorporating new modalities beyond raw pixel data.

Firstly, images and associated generation prompts are fed into Gemini1.5, which employs a question-answering approach to generate detailed textual descriptions and probability vectors for class predictions for each image in the dataset. Image features

O_{I}

are extracted using the image processing module (EfficientNet), text features

O_{T}

are derived via the text processing module (textCNN), and probability features

O_{V}

are obtained through the vector processing module (vecBlock). These three modality-specific features are concatenated along dimension 1 to form a multimodal feature tensor O, which is subsequently aligned using the Mamba mechanism. The aligned multimodal tensor is flattened into a one-dimensional vector and concatenated with the original image features

O_{I}

to preserve image detail information. Finally, the fused features pass through a two-layer fully connected network, undergoing dimensionality reduction, activation, and classification to output predictions across 11 categories. Figure 1 presents the overall architecture of the proposed Mamba-MDRNet learning algorithm.

3.2. Image Descriptions and Class Probabilities Generated by LLMs

In this section, we present the process of generating detailed image descriptions and class probabilities using pre-trained language models, with a focus on Google’s Gemini 1.5. These multimodal LLMs are capable of processing both images and prompts as input, producing corresponding outputs accordingly.

First, we input images of natural disasters into the model along with the prompt “Please describe the content of this picture in detail”. This allows the model to generate a detailed textual description of the natural disaster scene, accurately identifying the event and providing information on whether a human presence is detected. Next, we use the prompt, “Use pictures to judge natural phenomena. There are eleven picture categories: hail, snow, earthquake, rain, flood, wildfire, hurricane, lightning, sandstorm, frost, and haze. The output result is in probability distribution format”. The model then produces a probability distribution representing the likelihood of the image corresponding to each of the common natural disaster categories.

By utilizing these two types of prompts, we can generate two forms of modal information for the training images: one in the form of a detailed textual description and the other as a probability distribution of common natural disasters. Figure 2 provides a specific example of this process.

3.3. Mamba Multimodal Disaster Recognition Network (Mamba-MDRNet)

After utilizing the LLM to generate the required textual and conceptual information, we now explain how this information is used to enhance the model’s performance and provide detailed information about the Mamba-MDRNet network framework.

3.3.1. Feature Extraction Module

1.: Image Processing Module (EfficientNet)
This module is primarily designed to extract feature information from images. In our experiments, we explored several visual models for feature extraction, using EfficientNet as a representative example to demonstrate how the module captures features from the images. Specifically, we leveraged the pre-trained EfficientNet-B0 network, utilizing the weights trained on the ImageNet dataset. The input to EfficientNet is image data $I \in R^{H \times W \times C}$ , where H and W represent the height and width of the image, and C denotes the number of channels. In the experiments, we set the default image dimensions as $H = 224$ , $W = 224$ , and $C = 3$ . EfficientNet-B0 generates feature maps through convolution operations, which can be defined as follows:

$X_{I} = C o n v (I) \in R^{N \times D}$

(5)

Here, N represents the number of feature maps, and D is the embedding dimension. EfficientNet’s network architecture uses a compound scaling method to expand the model’s depth, width, and resolution in tandem, enhancing its representational capacity while maintaining computational efficiency. During this process, convolutions extract local features from the image and embed them as high-dimensional feature vectors.
After the convolution operation, the feature map $X_{I}$ is passed through a global average pooling (GAP) layer for feature aggregation, resulting in a global embedding vector $H_{I}$ :

$H_{I} = GAP (X_{I}) \in R^{D_{E}}$

(6)

where $D_{E} = 1280$ denotes the final embedding dimension of EfficientNet-B0. This embedding vector represents the global semantic information of the input image. Subsequently, $H_{I}$ is further mapped to the output space through the network’s fully connected layer, with the class probabilities defined as

$O_{I} = W_{cls} \cdot H_{I}^{T} + b_{cls}, O_{I} \in R^{n_{class}}$

(7)

where the weight matrix $W_{cls} \in R^{11 \times 1280}$ , the bias vector $b_{cls} \in R^{11}$ , and $n_{class} = 11$ . Here, $O_{I}$ represents the output results for the classification task.
In the designed image feature extraction module, we employ EfficientNet-B0, modifying only its final layer to output predictions for 11 classes.
2.: Text Processing Module (TextCNN)
The text processing module utilizes a CNN for feature extraction. The input text is first represented through an embedding layer. Let V denote the vocabulary size of the input text and $d_{e m b}$ represent the embedding dimension. The embedded text can be expressed as

$X_{T} = Embedding (T) \in R^{L \times d_{e m b}}$

(8)

where L is the maximum length of the input text. During the experiments, the embedding dimension $d_{emb}$ was set to 100.
Next, embedded text features $X_{T}$ are processed through multiple 1D convolutional kernels of varying sizes to extract local features. The output for each convolutional layer is as follows:

$C_{i} = ReLU (Conv 1 D (X_{T}, k_{i})) \in R^{L \times n_{i}}$

(9)

Global features for each kernel are obtained through max-pooling:

$P_{i} = MaxPool (C_{i}) \in R^{n_{i}}$

(10)

In the designed model, the kernel sizes of the convolutional layers are set to $k_{i} \in {3, 4, 5}$ , and the number of convolutional kernels is defined as $n_{i} = 150$ .
The outputs from all convolutional layers are concatenated to form the final text feature representation:

$H_{T} = Concat (P_{1}, P_{2}, P_{3}) \in R^{\sum n_{i}}$

(11)

To prevent overfitting, $H_{T}$ undergoes dropout regularization before being fed into the fully connected layer:

$H_{T} = Dropout (H_{T}, p = 0.1)$

(12)

where the dropout rate p is set to 0.1 in our experiments.
The concatenated text features are passed through a fully connected layer, mapping them to the output space of the classification task. The classification output is defined as

$O_{T} = W_{out} \cdot H_{T}^{T} + b_{out}, O_{T} \in R^{n_{class}}$

(13)

In our experiments, the number of classes for the classification task is set to 11. Consequently, the weight matrix of the fully connected layer is configured as $W_{out} \in R^{11 \times 450}$ , the bias vector is set as $b_{out} \in R^{11}$ and $n_{class} = 11$ .
3.: Vector Processing Module (VecBlock)
The vector processing module is designed to extract feature information from probability vectors and acts as a multilayer perceptron model. It employs a simple two-layer fully connected neural network to transform and process the input probability vector. The input vector dimension is defined as $d_{V} = 11$ in our experiments.
The input vector first passes through a fully connected layer, where the ReLU activation function is applied for nonlinear transformation, mapping the vector to a higher-dimensional feature space. This process is defined as

$H_{V} = ReLU (W_{1} \cdot V + b_{V 1}) \in R^{n_{1}}$

(14)

where $W_{1} \in R^{n_{1} \times d_{V}}$ is the weight matrix of the first fully connected layer, $b_{V 1} \in R^{n_{1}}$ represents the bias vector, and the output dimension is set to $n_{1} = 120$ in our experiments.
The output of the first layer is then passed through a second fully connected layer, where another ReLU activation function is applied. This process maps the features to the final output space and is defined as

$O_{V} = ReLU (W_{2} \cdot H_{V} + b_{V 2}) \in R^{n_{2}}$

(15)

where $W_{2} \in R^{n_{2} \times n_{1}}$ is the weight matrix of the second fully connected layer, and $b_{V 2} \in R^{n_{2}}$ is the corresponding bias vector. In our experiments, the final output dimension is set to $n_{2} = 11$ to match the required feature dimension for classification tasks.

3.3.2. Feature Fusion and the Mamba Mechanism

The features extracted from the three modalities,

O_{I}

,

O_{T}

, and

O_{V}

, are concatenated sequentially along the feature dimension to form the fused features:

O = Stack (O_{I}, O_{T}, O_{V}) \in R^{3 n}

(16)

The Mamba mechanism performs computations using the state-space model (SSM). Given the input fused features O, this results in the following output:

O_{M a m b a} = Mamba (O) \in R^{3 n}

(17)

Subsequently, the output features from the Mamba mechanism are flattened into a one-dimensional vector:

O_{flat} = Flatten (O_{Mamba}) \in R^{3 n}

(18)

The flattened features are then concatenated with the image features

O_{I}

, forming the final fused features:

O_{concat} = Concat ([O_{flat}, O_{I}], \dim = 1) \in R^{4 n}

(19)

3.3.3. Fully Connected Layer Classification

The concatenated features

O_{concat} \in R^{4 n}

(formed by the concatenation of the features described above) are passed through two fully connected layers for classification. The goal of the fully connected layers is to further map the fused features into the output space of the classification task.

In the first layer, the fully connected layer maps the concatenated features from the dimension

4 n

to a lower feature dimension

d_{H}

, defined as

O_{1} = ReLU (W_{lin 1} \cdot H_{concat} + b_{lin 1}) \in R^{B \times d_{H}}

(20)

where

H_{concat} \in R^{4 n}

represents the input concatenated features,

W_{lin 1} \in R^{d_{H} \times 4 n}

is the weight matrix of the first fully connected layer,

b_{lin 1} \in R^{d_{H}}

is the bias vector of the first layer, and

d_{H}

is the output feature dimension after this layer.

H \in R^{d_{H}}

represents the output features of the first layer.

In our experiments, the output dimension

d_{H}

is set to 6, ensuring that essential information is preserved while reducing the parameter size to improve the computational efficiency.

The second layer maps the output features H from the first layer to the class space of the classification task, producing the final unnormalized class scores. This is defined as

O_{final} = W_{lin 2} \cdot H + b_{lin 2} \in R^{B \times n_{class}}

(21)

where

H \in R^{B \times 6}

represents the output of the first layer,

W_{lin 2} \in R^{n_{class} \times d_{H}}

is the weight matrix of the second fully connected layer,

b_{lin 2} \in R^{n_{class}}

is the bias vector, and

n_{class}

is the number of classes in the classification task (set to

n_{class} = 11

in this experiment). The unnormalized class scores

O_{final} \in R^{B \times n_{class}}

are transformed into class probability distributions using the softmax function:

P (y_{i} | X) = \frac{exp (o_{i})}{\sum_{j = 1}^{n_{class}} exp (o_{j})}, i \in {1, 2, \dots, n_{class}}

(22)

where

P (y_{i} | X)

denotes the predicted probability for the i-th class, and X represents the input features.

4. Experiments

4.1. Datasets

To evaluate the effectiveness of the proposed Mamba-MDRNet algorithm in natural disaster scene classification, we collected a dataset of 5277 images representing 11 types of natural disasters from the internet. Specifically, it includes 590 images of “hail”, 608 images of “snowstorms”, 437 images of “earthquakes”, 497 images of “rainstorms”, 466 images of “floods”, 452 images of “fires”, 590 images of “hurricanes”, 355 images of “lightning”, 441 images of “sandstorms”, 399 images of “frost”, and 442 images of “smog”. The dataset is evenly divided into training and testing subsets with a 7:3 ratio. Using an LLM, we generated detailed scene descriptions for each image, incorporating elements such as the environmental background, weather conditions, individuals or buildings in the scene, and the intensity and scale of the disaster, along with a class probability for each disaster type.

4.2. Implementation Details

In this study, all experiments were conducted on Google Colab Pro, utilizing an NVIDIA Tesla T4 GPU with 16 GB of VRAM. The computational environment was based on Ubuntu 18.04, equipped with 2 virtual CPU cores and 16 GB of system memory. The deep learning framework used was PyTorch 1.10.0, running on CUDA 11.1 with cuDNN 8.0 for GPU acceleration. All images were preprocessed by converting them to the JPG format and cropping them to a resolution of

224 \times 224

.

During modal information extraction, the output of the final layer of the EfficientNet network for visual modality feature extraction was set to 11. For text modality feature extraction, convolutional kernels of sizes

{3, 4, 5}

were used, with each kernel having 150 filters. The fully connected layer’s output dimension for text modality features was set to 11. In the probability modality, the first fully connected layer had an input dimension of 11 and an output dimension of 120, while the second layer had an input dimension of 120 and an output dimension of 11. For the Mamba-MDRNet model, the fully connected classification layers were configured with an input dimension of 44 and an output dimension of 6 for the first layer, followed by an input dimension of 6 and an output dimension of 11 for the second layer.

The training process used a batch size of 8, CrossEntropyLoss as the loss function, and Adam optimization with a weight decay of

2 \times 10^{- 5}

. The model was trained from scratch for 50 epochs. The classification accuracy of the model was calculated using the following formula:

Accuracy = \frac{Number of Correct Predictions}{Total Number of Samples} \times 100 %

(23)

We conducted experiments on the selected hyperparameters, and the detailed results are presented in the Experimental Results section.

4.3. Experimental Results

We compared the performance of four commonly used computer vision models and architectures for image classification: a CNN [35], VIT [36], ResNet18 [37], and EfficientNet [38]. Specifically, we evaluated their individual performance, as well as their performance when integrated as the image processing module within the Mamba-MDRNet framework on a natural disaster image dataset. The experimental results are presented in Table 1. Here, ↑ indicates an improvement in accuracy.

The experimental results demonstrate that the accuracy of the CNN, VIT, ResNet18, and EfficientNet improves when they are integrated as the image processing module within the Mamba-MDRNet framework compared to their standalone performance. The accuracy improvements are 18.40%, 25.77%, 2.24%, and 2.65%, respectively. Among these, the VIT model shows the largest improvement when integrated with Mamba-MDRNet, while ResNet18 exhibits the smallest improvement. Further comparison of the classification accuracy of the four models within the Mamba-MDRNet framework reveals that EfficientNet achieves the best classification performance, with accuracy of 97.82%. To further evaluate the impact of multimodal architectures on the model performance, we conducted a series of experiments. These experiments involved both simpler constructions with only three modalities and more complex models combining three modalities using either Transformer or Mamba. We compared their accuracy and runtimes across 10, 20, 30, 40, and 50 training epochs to select the most suitable model. The detailed results are presented in Table 2 and Table 3.

Overall, the accuracy of the 12 tested models stabilized after 50 epochs. Compared to models handling simple modal information, the inclusion of the Transformer and Mamba modules significantly enhanced their performance. Notably, EfficientNet + TextCNN + MLP + Mamba achieved the highest accuracy of 97.82%. Both ResNet18 and EfficientNet exhibited similar performance after integrating the Transformer, with accuracy of 97.16%. However, after incorporating the Mamba module, EfficientNet showed a greater improvement than ResNet18, with 0.47% higher accuracy. This may be attributed to EfficientNet’s architecture, which is designed via neural architecture search (NAS) and better leverages its efficient computational structure when combined with the Mamba module, particularly in long-sequence tasks, resulting in higher accuracy and better resource utilization. While ResNet18 mitigates the vanishing gradient problem in deep networks through residual connections, its simpler design limits its ability to scale and optimize the model parameters dynamically, like EfficientNet. Consequently, ResNet18’s fixed structure is less capable of fully utilizing the Mamba module’s selective mechanism in complex multimodal tasks.

In terms of model efficiency, models incorporating the Transformer and Mamba modules generally took longer to train than simpler models, with those using the Mamba module requiring 1.17% more time than those using the Transformer. This increased training time for the Mamba module is likely due to its selective state-space mechanism, which, while improving the inference efficiency, introduces additional computational complexity during training due to its recursive and dynamic selection mechanisms. Although ResNet18 + TextCNN + MLP + Mamba is 6.64% faster than EfficientNet + TextCNN + MLP + Mamba, in real-world rescue operations, the accuracy of scene recognition is more critical. Incorrect decisions can further waste valuable time and resources, making EfficientNet + TextCNN + MLP + Mamba the preferred choice for rescue scenario analysis.

4.4. Experimental Analysis

4.4.1. Ablation Study

We examined the effects of image information, text information, probabilistic information, and the Mamba module within Mamba-MDRNet. Table 4 presents the results of different Mamba-MDRNet variants tested on a natural disaster dataset. A comparison of Variables 1, 2, and 3 highlights the performance differences when combining two types of modal information. Among these, probabilistic information had the greatest impact on the model’s accuracy, while text information had the least. The results for Variable 4 demonstrate that the Mamba module effectively enhances the performance of multimodal models.

By comparing Variables 1 through 6, it is evident that the four modules complement and reinforce each other, collectively improving the model’s performance. By comparing Variables 1 to 3, we can observe the performance differences when combining different modalities. Among them, probabilistic information has the greatest impact on the accuracy, while textual information contributes less. From Variable 4, we can see that the Mamba module indeed improves the performance of multimodal models. Variable 5 shows that, although probabilistic information has the largest effect on the model accuracy, using only probabilistic information for prediction results in lower accuracy than in models with any two modalities from Variables 1 to 3. This indicates that while LLMs can provide highly accurate probabilistic features for images, these features are compressed representations of the original data, neglecting crucial details. Relying solely on these compressed features causes the model to lose key spatial and textural information from the image, as well as rich contextual information from the text, resulting in inferior classification performance compared to Mamba-MDRNet.

4.4.2. Visual Interpretation

To intuitively validate the differences between Mamba-MDRNet and the baseline model (utilizing only three modalities) in image recognition, we employed the Grad-CAM [39] visualization technique. Attention maps were generated for the traditional visual model baseline (EfficientNet) and our proposed Mamba-MDRNet method (EfficientNet + TextCNN + MLP + Mamba). The results are shown in Figure 3.

As illustrated, the heatmaps produced by the baseline model are generally smoother, with heat concentrated in localized regions and exhibiting softer color transitions. In contrast, the heatmaps generated by Mamba-MDRNet display finer-grained feature representations and more complex heat distributions, revealing detailed information in various disaster scenarios. The baseline model’s simpler color distributions may lead to misinterpretations in complex disaster contexts. However, the richer feature representations of Mamba-MDRNet potentially enhance its discriminative and generalization capabilities, particularly in disaster categories involving complex weather conditions, such as lightning and sandstorms.

4.4.3. Classification Report

To thoroughly analyze the issues encountered during the model’s classification process, we randomly selected a single run for evaluation. The performance was assessed using four metrics: precision, recall, F1-score, and support. The results are presented in Table 5.

The analysis reveals that the model performs exceptionally well across all 11 categories in terms of precision, with values consistently above 0.93. Notably, the precision for hail, lightning, and hurricanes approaches 1, while the performance in the snow, frost, and earthquake categories is slightly lower, at 0.937, 0.938, and 0.944, respectively. In terms of recall, hail, wildfire, lightning, and hurricanes achieve near-perfect scores, while the recall for the smog category is only 0.876, with the remaining categories all exceeding 0.95.

This indicates that the model tends to classify hard-to-distinguish natural disaster scenarios as negative classes. In such cases, the model is more cautious when predicting the positive class (correct disaster categories), only making positive predictions when the sample closely aligns with the positive class features. As a result, the positive class predictions are highly reliable (high precision), but many true positive samples are missed, leading to lower recall. Additionally, due to the diverse forms within certain disaster categories, some forms may be underrepresented or lack distinct features, prompting the model to adopt a conservative approach to avoid false positives, which further reduces the recall for the positive class.

4.4.4. Confusion Matrix

To further analyze the recall, we introduced a confusion matrix, as shown in Figure 4. Upon examining the matrix, we observed that the likelihood of misclassification increased when two natural disaster scenarios exhibited highly similar characteristics. In the selected results, snow was frequently misclassified as haze, with nine instances recorded. Additionally, floods and rain were confused with each other twice, haze and sandstorms were mistaken for frost twice, and floods were misidentified as earthquakes on two occasions.

These misclassifications can be attributed to shared features among these natural phenomena. For example, both snow and haze reduce visibility due to the presence of small particles in the air—whether snowflakes or pollutants—resulting in a blurred effect. Snowflakes in the air during snowfall can obscure vision, while suspended particles in haze create a similar hazy appearance. Moreover, both scenes tend to exhibit a generally dim environment. Likewise, floods and rain both result in water covering surfaces such as roads. If the model fails to detect the critical feature of raindrops, it may confuse these two scenarios.

4.4.5. Parameter Sensitivity

Table 6 illustrates the sensitivity of the Mamba-MDRNet network to the learning rate and batch size. Specifically, the value of the batch size varies within {8, 16, 32, 64}, while the learning rate (LR) ranges within {2, 2, 22}. The results indicate the strong dependency of the model performance on both hyperparameters.

At a higher learning rate of

10^{- 3}

, the accuracy improves steadily with an increasing batch size, from 94.1% at a batch size of 8 to 96.3% at a batch size of 64. However, at

10^{- 4}

, the performance stabilizes, with marginal variations, peaking at 97.5% for a batch size of 32. When the learning rate is further reduced to

10^{- 5}

, the model achieves its best overall accuracy of 97.8% with a batch size of 8, although the performance fluctuates slightly at larger batch sizes. Conversely, a very low learning rate of

10^{- 6}

results in a noticeable performance drop, with accuracies ranging from 93.6% to 95.7%, showing reduced robustness across batch sizes.

These findings underscore the importance of carefully tuning both the learning rate and batch size to optimize the model performance. Notably, the best trade-off between the two is observed at

10^{- 5}

with a batch size of 8, achieving peak accuracy.

5. Conclusions

In summary, this study utilizes large models and the Mamba module to assess natural disaster scenarios. We propose a framework, Mamba-MDRNet, in which traditional visual models are enhanced with learned representations. By integrating text and probabilistic modality information from large models with the original image modality, we construct a multimodal model. The selective mechanism of the Mamba module dynamically adjusts the weights of each modality, selectively focusing on the most important features based on the quality, relevance, or confidence of the input modalities. Extensive experiments were conducted on our designed natural disaster dataset, and we validated the generalizability and effectiveness of the proposed algorithm across various heterogeneous visual perception architectures. In our comparisons, Mamba-MDRNet with EfficientNet achieved the best performance in classifying natural disaster scenarios. Our approach leverages large pre-trained models to enhance traditional visual models without requiring substantial computational resources to train and deploy LLMs.

This method facilitates more efficient natural disaster scenario classification, even in resource-constrained environments, helping to inform appropriate rescue plans. As large pre-trained models continue to evolve, they will capture more modality information in ambiguous scenarios, leading to more precise disaster scene classification.

Author Contributions

Conceptualization, Y.S. and L.X.; methodology, Y.S.; software, Y.S.; validation, Y.S. and L.X.; formal analysis, Y.S.; investigation, Y.S.; resources, L.X.; data curation, Y.S.; writing—original draft preparation, Y.S.; writing—review and editing, Y.S.; visualization, Y.S.; supervision, Y.S. and L.X.; project administration, Y.S. and L.X.; funding acquisition, L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was sponsored by the Yuxiu Innovation Project of NCUT (Project No. 2024NCUTYXCX104).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available at https://ieee-dataport.org/documents/natural-disaster-scene (accessed on 3 December 2024).

Acknowledgments

The authors would like to express their sincere gratitude and respect to their supervisor, Liwen Xu. They are grateful for his continuous support and assistance, and his guidance has enabled them to constantly grow and progress on their academic path. They will always cherish the precious learning experienced under his guidance.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
MDRNet	Mamba Disaster Recognition Network
CNN	Convolutional Neural Network
ViT	Vision Transformer
EM-DAT	Emergency Events Database
GPT	Generative Pretrained Transformer
MLP	Multilayer Perceptron
NLP	Natural Language Processing
BoW	Bag of Words
TF-IDF	Term Frequency-Inverse Document Frequency
SIFT	Scale-Invariant Feature Transform
HOG	Histogram of Oriented Gradients
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
VLM	Vision Language Model
T5	Text-To-Text Transfer Transformer
VQA	Visual Question Answering
DALL·E	Deep Learning-Based Image Generation Model by OpenAI
BLIP	Bootstrapping Language-Image Pretraining
Gemini1.5	Google DeepMind’s Multimodal Model Series
AI	Artificial Intelligence
MML	Multimodal Learning
SSM	State-Space Model
CLIP	Contrastive Language–Image Pretraining
UNITER	UNiversal Image-TExt Representation
JPG	Joint Photographic Experts Group
NAS	Neural Architecture Search

References

González, F.A.I.; Santos, M.E.; London, S. Persistent effects of natural disasters on human development: Quasi-experimental evidence for Argentina. Environ. Dev. Sustain. 2021, 23, 10432–10454. [Google Scholar] [CrossRef]
Weber, E.; Marzo, N.; Papadopoulos, D.P.; Biswas, A.; Lapedriza, A.; Ofli, F.; Imran, M.; Torralba, A. Detecting natural disasters, damage, and incidents in the wild. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIX 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 331–350. [Google Scholar]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Kalinicheva, E.; Ienco, D.; Sublime, J.; Trocan, M. Unsupervised change detection analysis in satellite image time series using deep learning combined with graph-based approaches. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1450–1466. [Google Scholar] [CrossRef]
Oveis, A.H.; Giusti, E.; Ghio, S.; Martorella, M. A survey on the applications of convolutional neural networks for synthetic aperture radar: Recent advances. IEEE Aerosp. Electron. Syst. Mag. 2021, 37, 18–42. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Yang, H.; Zhao, Y.; Wu, Y.; Wang, S.; Zheng, T.; Zhang, H.; Che, W.; Qin, B. Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey. arXiv 2024, arXiv:2406.08068. [Google Scholar]
Nadeem, M.; Sohail, S.S.; Javed, L.; Anwer, F.; Saudagar, A.K.J.; Muhammad, K. Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition. Cogn. Comput. 2024, 16, 1–14. [Google Scholar] [CrossRef]
Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A comprehensive overview of large language models. arXiv 2023, arXiv:2307.06435. [Google Scholar]
Yang, W.; Li, C.; Zhang, J.; Zong, C. Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages. arXiv 2023, arXiv:2305.18098. [Google Scholar]
Reid, M.; Savinov, N.; Teplyashin, D.; Lepikhin, D.; Lillicrap, T.; Alayrac, J.-B.; Soricut, R.; Lazaridou, A.; Firat, O.; Schrittwieser, J.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar]
Goyal, V.; Sharma, S.; Garg, B. Texture Classification Using ResNet and EfficientNet. In Machine Intelligence Techniques for Data Analysis and Signal Processing, Proceedings of the 4th International Conference MISP 2022, Raipur, India, 12–14 March 2022; Springer: Berlin/Heidelberg, Germany, 2023; Volume 1, pp. 173–185. [Google Scholar]
Ding, N.; Tang, Y.; Fu, Z.; Xu, C.; Han, K.; Wang, Y. GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception Tasks? arXiv 2023, arXiv:2306.00693. [Google Scholar]
Johnson, R.; Zhang, T. Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. arXiv 2014, arXiv:1412.1058. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Zhang, Y.; Jin, R.; Zhou, Z.-H. Understanding bag-of-words model: A statistical framework. Int. J. Mach. Learn. Cybern. 2010, 1, 43–52. [Google Scholar] [CrossRef]
Kim, S.-W.; Gil, J.-M. Research paper classification systems based on TF-IDF and LDA schemes. Hum. Centric Comput. Inf. Sci. 2019, 9, 1–21. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Carcagnì, P.; Del Coco, M.; Leo, M.; Distante, C. Facial expression recognition and histograms of oriented gradients: A comprehensive study. SpringerPlus 2015, 4, 645. [Google Scholar] [CrossRef]
Ghojogh, B.; Ghodsi, A. Recurrent neural networks and long short-term memory networks: Tutorial and survey. arXiv 2023, arXiv:2304.11461. [Google Scholar]
Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 140, 1–67. [Google Scholar]
Zbinden, R. Implementing and experimenting with diffusion models for text-to-image generation. arXiv 2022, arXiv:2209.10948. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR: Cambridge, MA, USA, 2022; pp. 12888–12900. [Google Scholar]
Yuan, L.; Chen, D.; Chen, Y.-L.; Codella, N.; Dai, X.; Gao, J.; Hu, H.; Huang, X.; Li, B.; Li, C.; et al. Florence: A new foundation model for computer vision. arXiv 2021, arXiv:2111.11432. [Google Scholar]
Zhu, X.; Huang, Y.; Wang, X.; Wang, R. Emotion recognition based on brain-like multimodal hierarchical perception. Multimed. Tools Appl. 2024, 83, 56039–56057. [Google Scholar] [CrossRef]
Ramachandram, D.; Taylor, G.W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Ding, Y.; Li, Y.; Wu, W.; Rong, Y.; Kong, W.; Huang, J.; Li, S.; Yang, H.; et al. State space model for new-generation network alternative to transformers: A survey. arXiv 2024, arXiv:2404.09516. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Jelassi, S.; Brandfonbrener, D.; Kakade, S.M.; Malach, E. Repeat after me: Transformers are better than state space models at copying. arXiv 2024, arXiv:2402.01032. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the Computer Vision—ECCV 2020, 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–120. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Tan, M. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. The overall framework diagram of the Mamba-MDRNet learning algorithm. Only the model within the dashed box will be deployed and used for inference after training.

Figure 2. Generated by Gemini1.5 using prompts.

Figure 3. Grad-CAM heatmaps for 11 natural disaster categories.

Figure 4. Comparison of confusion matrices for natural disaster classification.

Table 1. Comparison of the performance of different visual models integrated with Mamba-MDRNet.

Model	Top-1 Accuracy (%)	Improvement (%)
CNN	73.35
Mamba-MDRNet (Image Processing Module is CNN)	91.75	↑18.40
VIT	68.73
Mamba-MDRNet (Image Processing Module is VIT)	94.50	↑25.77
ResNet18	95.01
Mamba-MDRNet (Image Processing Module is ResNet18)	97.25	↑2.24
EfficientNet	95.17
Mamba-MDRNet (Image Processing Module is EfficientNet)	97.82	↑2.65

Table 2. Accuracy comparison across 10, 20, 30, 40, and 50 epochs for different models integrated with TextCNN, MLP, Transformer, and Mamba.

Model	10-Epoch Accuracy (%)	20-Epoch Accuracy (%)	30-Epoch Accuracy (%)	40-Epoch Accuracy (%)	50-Epoch Accuracy (%)
CNN + TextCNN + MLP	92.63	94.12	94.69	92.81	93.84
VIT + TextCNN + MLP	93.55	94.22	94.31	94.41	94.50
Resnet18 + TextCNN + MLP	93.55	94.12	94.50	94.50	94.79
Efficientnet + TextCNN + MLP	92.70	92.99	93.46	93.46	94.22
CNN + TextCNN + MLP + Transformer	89.10	89.76	89.76	91.75	92.89
VIT + TextCNN + MLP + Transformer	94.12	94.31	94.69	95.17	95.17
Resnet18 + TextCNN + MLP + Transformer	96.68	96.78	96.97	96.97	97.16
Efficientnet + TextCNN + MLP + Transformer	96.97	97.16	97.06	97.16	97.16
CNN + TextCNN + MLP + Mamba	90.71	91.28	91.56	91.56	91.75
VIT + TextCNN + MLP + Mamba	92.32	92.89	93.18	93.36	94.50
Resnet18 + TextCNN + MLP + Mamba	96.68	96.97	97.16	97.16	97.25
Efficientnet + TextCNN + MLP + Mamba	97.44	97.54	97.82	97.82	97.82

Table 3. Comparison of training time (in seconds) for different models with TextCNN, MLP, Transformer, and Mamba, evaluated across 10, 20, 30, 40, and 50 epochs.

Model	10 Epochs (s)	20 Epochs (s)	30 Epochs (s)	40 Epochs (s)	50 Epochs (s)
CNN + TextCNN + MLP	1330.80	2633.33	3939.43	5154.41	6532.31
VIT + TextCNN + MLP	2426.06	4861.36	7314.91	9726.87	12138.95
Resnet18 + TextCNN + MLP	1349.23	2697.17	4024.09	5328.33	6737.37
EfficientNet + TextCNN + MLP	1537.36	3029.55	4604.01	6142.67	7478.36
CNN + TextCNN + MLP + Transformer	1337.27	2619.55	3892.58	5177.68	6540.05
VIT + TextCNN + MLP + Transformer	2495.81	5002.86	7481.98	9790.50	12479.33
Resnet18 + TextCNN + MLP + Transformer	1392.34	2776.66	4164.51	5590.57	6958.78
EfficientNet + TextCNN + MLP + Transformer	1500.34	2989.56	4464.44	5936.96	7380.73
CNN + TextCNN + MLP + Mamba	1314.78	2624.52	3983.90	5313.03	6677.78
VIT + TextCNN + MLP + Mamba	2494.70	5002.63	7519.97	10022.49	12683.56
Resnet18 + TextCNN + MLP + Mamba	1489.99	2964.33	4279.99	5657.92	6972.15
EfficientNet + TextCNN + MLP + Mamba	1624.76	3203.93	4686.35	5935.88	7434.97

Table 4. Ablation study results for different Mamba-MDRNet variants on a natural disaster dataset.

Variable	EfficientNet	TextCNN	MLP	Mamba	Accuracy (%)
Variable 1	✓	✓		✓	92.32
Variable 2	✓		✓	✓	97.16
Variable 3		✓	✓	✓	95.07
Variable 4	✓	✓	✓		94.22
Variable 5			✓		91.20
Variable 6	✓	✓	✓	✓	97.91

Table 5. Classification performance metrics by category.

Class	Precision	Recall	F1-Score	Support
Hail	1.000	1.000	1.000	127
Snow	0.937	0.978	0.957	137
Earthquake	0.944	0.986	0.965	69
Rain	0.982	0.956	0.969	113
Flood	0.976	0.976	0.976	84
Wildfire	0.989	1.000	0.994	88
Hurricane	0.990	0.990	0.990	98
Lightning	1.000	1.000	1.000	76
Sandstorm	0.988	0.964	0.976	83
Haze	0.987	0.876	0.929	89
Frost	0.938	0.989	0.963	91

Table 6. Sensitivity of Mamba-MDRNet to learning rate and batch size.

	8	16	32	64
Learning Rate	8	16	32	64
$2 \times 10^{- 3}$	94.1	94.6	95.2	96.3
$2 \times 10^{- 4}$	97.1	97.2	97.5	97.3
$2 \times 10^{- 5}$	97.8	97.0	97.3	97.5
$2 \times 10^{- 6}$	95.7	93.6	93.6	94.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shao, Y.; Xu, L. Multimodal Natural Disaster Scene Recognition with Integrated Large Model and Mamba. Appl. Sci. 2025, 15, 1149. https://doi.org/10.3390/app15031149

AMA Style

Shao Y, Xu L. Multimodal Natural Disaster Scene Recognition with Integrated Large Model and Mamba. Applied Sciences. 2025; 15(3):1149. https://doi.org/10.3390/app15031149

Chicago/Turabian Style

Shao, Yuxuan, and Liwen Xu. 2025. "Multimodal Natural Disaster Scene Recognition with Integrated Large Model and Mamba" Applied Sciences 15, no. 3: 1149. https://doi.org/10.3390/app15031149

APA Style

Shao, Y., & Xu, L. (2025). Multimodal Natural Disaster Scene Recognition with Integrated Large Model and Mamba. Applied Sciences, 15(3), 1149. https://doi.org/10.3390/app15031149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Natural Disaster Scene Recognition with Integrated Large Model and Mamba

Abstract

1. Introduction

2. Related Work

2.1. Large-Scale Pre-Trained Models

2.2. Multimodal Tasks

2.3. Mamba Selective State-Space Model

3. Methods

3.1. Overall Framework

3.2. Image Descriptions and Class Probabilities Generated by LLMs

3.3. Mamba Multimodal Disaster Recognition Network (Mamba-MDRNet)

3.3.1. Feature Extraction Module

3.3.2. Feature Fusion and the Mamba Mechanism

3.3.3. Fully Connected Layer Classification

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Experimental Results

4.4. Experimental Analysis

4.4.1. Ablation Study

4.4.2. Visual Interpretation

4.4.3. Classification Report

4.4.4. Confusion Matrix

4.4.5. Parameter Sensitivity

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI