1. Introduction
Landslides describe the downward or outward movement of substances that make up the slope due to the erosion caused by gravity, resulting in different forms of landslides, including rock mass, soil mass, artificial soil mass, or their combination. A landslide causes severe damage to natural environments, properties, and personal safety all over the world. Creating a landslide inventory map is crucial to recording the location, distribution, number, and extent of landslides in a study area [
1,
2]. Landslides frequently occur in China, causing significant damage to life, property, and the economy. From 2018 to 2021, the number of landslides and the losses they caused consistently ranked first among all geological disasters in China. Landslides can be triggered by various factors such as rainfall, earthquakes, permafrost degradation, reservoir filling, and urbanization, resulting in the movement of rocks and soil downhill [
1,
3,
4,
5]. Knowing their occurrence, distribution, and trends is vital to creating disaster prevention strategies and improving disaster reduction efforts. Recognizing landslide disasters and risks is crucial in quickly determining their location, quantity, and distribution after their occurrence [
6].
The conventional methods for recognizing potential landslide areas and updating landslide inventories involve field surveys, which are known to be time-consuming, costly, and inefficient. Recently, with the rapid development of remote sensing technology and machine learning algorithms, automatic landslide recognition from multi-source remote sensing data has become possible and promising in geoscience research. In recent years, there has also been a growing interest in the recognition of landslides from optical images. Digital elevation model (DEM) data, which provides topographic information, plays a crucial role in recognizing landslides [
7,
8,
9]. Since the recognition of landslides from remote sensing images could be defined as a pixel-level image classification problem, a variety of statistical and deep learning methods have been widely utilized. Currently, convolutional neural networks (CNNs) have become the mainstream method for deep learning and have been applied to this task [
10]. For instance, Cai et al. [
11] proposed a deep learning model with dense connections based on patch blocks. Supervised machine learning methods and statistical methods depend heavily on the availability of high-quality labeled data on landslides for training and evaluation datasets. Therefore, constructing labeled landslide recognition datasets is crucial for accurate recognition and analysis of landslide regions.
In recent years, the field of landslide recognition has witnessed advancements in deep learning techniques, particularly in using Convolutional Neural Networks (CNNs) and Transformer models with attention mechanisms. CNNs have emerged as the dominant approach in landslide recognition due to their exceptional ability to learn representations through convolutions. This enables the network to automatically recognize semantic features associated with landslide bodies without the need for manual calculation of complex landslide features, including image classification [
12,
13], object detection [
14,
15], and semantic segmentation [
16,
17].
However, CNNs have inherent limitations. Backpropagation often leads to slow parameter updates, convergence to local optima, information loss in pooling layers, and unclear interpretation of extracted features. To overcome these challenges, researchers have introduced Transformer models with attention mechanisms, which offer notable advantages. The Transformer model, initially proposed by the Google team in 2017 [
18], replaces the convolutional neural network component of CNN with a self-attention module. This model employs multiple attention heads that specialize in distinct tasks and capture diverse input data features, thereby enhancing landslide recognition performance in remote sensing images (RSIs).
The Transformer model excels at learning intricate relationships between various data features, including geological characteristics, climate data, and satellite imagery, enabling it to effectively recognize landslide areas. Additionally, its ability to simultaneously analyze data from multiple sources and formats makes it more versatile compared to traditional landslide recognition models. By leveraging attention mechanisms, the Transformer model provides insights into the attention distribution within the model, further enhancing interpretability and understanding of the landslide recognition process.
Hinton et al., first introduced the concept of knowledge distillation in their article “Distilling the knowledge in a neural network” [
19], the core idea of which is that once a complex network model is trained, a smaller model can be extracted from the complex model using another training method. The core idea is that once a complex network model is trained, a smaller model can be extracted from the complex model by another training method, so the knowledge distillation framework usually contains a large model (called the teacher model) and a small model (called the student model).
However, very few studies have been proposed to apply the transformer model to landslide disasters [
20]. Lv et al., proposed a shape-enhanced visual transformer (ShapeFormer) model for better extraction and retention of multiscale shape information of landslide bodies [
21]. Deep learning models are becoming increasingly popular in geoscience applications, but they still face challenges when dealing with large-scale natural disasters such as landslides. The main difficulty is the significant increase in data processing time when using a model with a large number of parameters, which can reduce the efficiency of the model and place high demands on computer hardware. To address the above issues, this study makes two main contributions to improving the efficiency of deep learning models for landslide recognition: (1) We designed a novel deep learning network using the Swin-Transformer as the backbone structure, with the aim of reducing the number of model parameters while maintaining accuracy in landslide recognition. Our goal was to improve the processing efficiency and reduce the model’s running time. The experimental results showed that our model performed well in recognizing landslides. (2) We constructed a multi-source landslide recognition dataset in the study area, which includes the RSIs, landslide influencing factors (LIFs), and RSIs + LIFs, respectively. The combination of spectral and environmental factors is introduced to improve the performance of deep learning in landslide recognition.
3. Methods
3.1. Residual Network (ResNet)
CNNs have grown deeper over time to extract deeper features, but increasing the number of layers can lead to the vanishing gradient problem, which can reduce accuracy. Initially, the LeNet network had only 5 layers, followed by AlexNet with 8 layers, and later the VggNet network included 19 layers, while GoogleNet had 22 layers. Traditional methods use data initialization and regularization to solve this problem, but ResNet networks use residual units to directly transfer inputs across layers, which improves feature expression and performance. The key to the ResNet network is the residual unit in its structure. The residual network unit contains cross-layer connections that can directly transfer the input across layers, perform the same mapping, and then add the results after the convolution operation. If the input image is x, the output is H(x), and the output after convolution is a nonlinear function of F(x), then the final output is H(x) = F(x) + x. Such an output can still be nonlinearly transformed, and the residual refers to the “difference”, which is F(x), and the network is transformed into the residual function F(x) = H(x) − x. The residual function is thus easier to optimize than F (x) = H (x).
Figure 3 shows the classical transformer structure, which might help readers understand the difference between transformer structures and CNNs.
3.2. Swin-Transformer (Swin)
Swin-Transformer is a new approach to the traditional Transformer architecture that uses “local attention windows” to reduce computation and memory consumption while maintaining accuracy. It divides input features into smaller patches to handle larger input images and uses a hierarchical structure to capture different features at different levels. The Swin also uses “Shifted Window” to interact with other windows. The Swin has achieved good performance in computer vision tasks such as image classification, target detection, and semantic segmentation.
3.3. Data-Efficient Image Transformers (DeiT)
Data-efficient image Transformers (DeiT) is a deep learning model that uses a variant of the transformer architecture called “vision transformer” (ViT) for image classification. ViT divides an image into patches, treats each patch as a token, and processes them using transformer layers to capture spatial relationships. DeiT is designed to be data-efficient, achieving high performance on image classification tasks with relatively small amounts of training data using techniques such as distillation, data augmentation, and regularization. Its success demonstrates the potential of transformer-based architectures for computer vision applications.
3.4. Distilled Swin-Transformer (DST)
We propose a new deep learning model based on the Swin-Transformer architecture, which aims to address the issues of large model parameter sizes and low efficiency. Our approach incorporates knowledge distillation, which trains a lightweight student model using supervisory information from a larger and better-performing teacher model. This helps reduce computational costs while maintaining accuracy.
Unlike other model compression techniques such as pruning and quantization, knowledge distillation transfers knowledge from the teacher model to the student model during training. This allows the student model to learn from the teacher model’s supervisory information and improve its performance. Knowledge distillation also compresses network parameters, which reduces the overall parameter size of the model.
In our proposed model, First, the input image is divided into smaller patch blocks using the Patch Partition layer. These blocks are then linearly embedded to capture relevant information within each patch. Distillation Tokens, serving as additional learnable parameters, are introduced between the Patch Embedding and Position Embedding layers to enable knowledge transfer from a larger teacher model. The patch representations, along with the added Distillation Tokens, are processed through Patch Embedding and Position Embedding layers, encoding both distillation and positional information. Multiple Transformer layers capture long-range dependencies and extract high-level features. Finally, the representations are sent to the Classifier Head for classification.
Distillation helps transfer knowledge from a larger, high-performing teacher model to a student model. This allows the student model to learn from the teacher’s supervisory information, resulting in compressed network parameters and a decreased overall parameter size of the model. Despite the compression, the model still maintains relative accuracy.
Figure 4 illustrates the resulting model architecture of DST.
3.5. Flowchart of Landslide Recognition
The whole experimental process consists of seven parts. In the first and second parts, geological maps, RSIs, Google Earth images, and DEM data were acquired. These data and historical landslide records were used to prepare landslide labels for the study and to generate the required nine LIFs. In the third and fourth parts, the nine generated LIFs are screened. The sample database for the Zigui landslide area has been generated. In the fifth part, the training set, validation set, and test set are divided in the ratio of 6:2:2, and the landslide extraction is performed using the DST and other models. In the sixth and seventh parts, the performance of the landslide recognition model is evaluated and compared, and the effects of some parameters on the experimental results are discussed. The complete experimental flow used in this paper is shown in
Figure 5.
3.6. Model Evaluation Metrics
In this paper, five statistical metrics, namely Overall Accuracy (OA), Precision, Recall, F1-score (F1), and Kappa coefficient (Kappa), are used to evaluate the model’s performance. Among them, OA indicates the number of pixels correctly classified as landslides accounted for the total number of pixels; Precision indicates how many of the samples predicted as landslides are real landslide samples; Recall indicates how many of the landslide samples in the sample set are accurately predicted; and F1 is the summed average of the accuracy and recall. The values of these four evaluation metrics range from 0 to 1, where the closer the value is to 1, the better the performance of the corresponding model. The Kappa coefficient is a method used to evaluate statistical consistency, and we used it to evaluate the accuracy of the multiclass classification model. Kappa can quantitatively evaluate the agreement between the classification results and the true labels. When its value is greater than 0.8, the agreement can be considered good.
The calculation formulas are shown as follows:
True Positive (TP) means that the true value is a landslide and the predicted value is also a landslide, which means that the landslide sample is correctly predicted; False Positive (FP) means that the sample with the true value of non-landslide is recognized as a landslide, which means over-identification; False Negative (FN) means that the sample with the true value of landslide is predicted as non-landslide, which means the omission of identification; and True Negative (TN) means that the non-landslide sample is correctly predicted.
In addition to these metrics, four other factors are selected to measure the model run efficiency: the number of model parameters (Params), the number of floating-point operations (FLOPs), the average iteration time, and the model run time. Under the condition that the OA is similar, the higher the model run efficiency, the shorter the model running time.
3.7. Model Hyperparameter Settings
Table 2 displays the hyperparameter settings used for the landslide recognition model. The model was trained for 150 epochs using a sigmoid activation function, an AdamW optimizer, and a learning rate of 0.0000003. To ensure analytical comparison, the same hyperparameter settings were applied to all models.
3.8. Landslide Influencing Factor Analysis
3.8.1. Landslide Influencing Factor Analysis
After historical research and expert analysis, it appears that there may be statistical correlations and collinearity relationships among the initially selected LIFs. These relationships can potentially lead to an inaccurate analysis of the true relationship between LIFs and landslides in the landslide recognition model. Additionally, there are multiple factors that can affect landslides, and the abundance of information in the evaluation factors may affect the accuracy of the landslide recognition results. To address these concerns, this study has adopted a quantitative approach to evaluate and select LIFs from three different perspectives: correlation analysis, collinearity testing, and importance evaluation.
3.8.2. Correlation Analysis
In this study, the Pearson correlation coefficient (PCC) was used to characterize the correlation between each of the selected LIFs [
23]. In statistics, PCC is used to measure the linear correlation between two variables, X and Y, with values ranging from −1 to 1. This linear correlation can be intuitively expressed as whether Y increases or decreases as X increases. When the PCC value is positive, it indicates a positive correlation, and when it is negative, it indicates a negative correlation. When the two variables are distributed on a straight line, the PCC value is equal to 1 or −1; if there is no linear relationship between the two variables, the PCC value is 0; and a value between 0 and 1 indicates a stronger correlation.
The visual heat map of the correlation coefficients of the nine influencing factors in the study area is shown in
Figure 6, where the darker the color, the larger the value, and the stronger the correlation, with positive numbers being positive correlations and negative numbers being negative correlations. It can be seen that a few factors have a weak negative correlation with each other, while most of them have a positive correlation with each other, but the correlation is not strong. In general, the correlation between these factors is not high, and the correlation coefficients are less than the critical value of 0.7, so no factors were removed in this study.
3.8.3. Collinearity Testing
To further assess the correlation between LIFs, a multicollinearity analysis was performed on the factors [
24,
25]. In this study, the Variance Inflation Factor (VIF) and Tolerance (TOL) of each LIF were calculated using SPSS Statistics software. VIF represents the variance inflation factor, while TOL represents the tolerance [
24]. The calculation formula for VIF is:
An independent data set
X with n variables,
X = {
X,
X2, …,
Xn}, where
represents the deterministic coefficient of the ith independent variable when regressed on all other predictors in the model. TOL is numerically the reciprocal of VIF. Typically, factors should be removed if the VIF is greater than 10 or 5. TOL is the inverse of VIF, where a value between 0 and 1 indicates the strength of the collinearity between the independent variables, and the closer TOL is to 1, the weaker the collinearity. As shown in
Table 3, there was no multicollinearity between any of the LIFs.
4. Results Analysis
4.1. Performance Analysis of the Dataset Type
In order to verify that the additional LIFs + RSIs-based dataset can improve the performance of landslide recognition, three different dataset types are fed into our proposed model (DST) for landslide recognition. The results are shown in
Figure 7.
The results of landslide recognition based on the RSIs dataset are shown in
Figure 7a. From the recognition results, it can be observed that there are errors and omissions in the landslide recognition results. In general, most of the recognized landslides correspond to the actual landslide boundaries, but there are still phenomena beyond the boundaries. The recognition of small landslides was inadequate, or only a few pixels of small landslides were recognized. Two factors may have contributed to this occurrence. First, the study area had a significant imbalance in the number of samples for different categories, with a ratio of non-landslide to landslide pixels of 16:1, which could explain the discrepancy. Second, the dataset may have also contributed to this problem, as some small landslides may have been obscured by vegetation cover and therefore not fully recognized by the optical imagery. The results of the LIFs-based landslide recognition are shown in
Figure 7b. It can be seen that the overall landslide map is not satisfactory, with numerous errors and omissions. Some landslides close to each other are not accurately distinguished. The reason for this is that the LIFs do not carry the spectral, textural, geometric, and spatial features of the landslides, which have a great influence on the landslide recognition results.
Figure 7c shows the results of using the RSIs + LIFs for landslide recognition. The overall recognition results show some improvement compared to using only RSIs, with an OA improvement of 0.8381%. The recognition results show that landslide RSIs + LIFs-based landslide recognition plays an important role in landslide recognition, and the new landslide sample library is effective.
To quantitatively evaluate the model performance and landslide recognition results for three different training datasets, several evaluation factors mentioned in Part 3.6. of this paper were calculated using confusion matrices, and the evaluation metrics are shown in
Table 4. The RSIs + LIFs-based model has the highest OA, recall, F1, and Kappa. The higher recall reflects the more recognized landslides, and the higher Precision reflects the more correctly recognized landslides. The experiments show that combining RSIs with LIFs makes the landslide recognition model easier to distinguish landslides from bare rock, soil, and other ground features, which could significantly improve the recognition accuracy and precision of the models in the Zigui study area. However, the results also show that using only LIF data is not sufficient, indicating that optical imagery plays a dominant role in landslide recognition. The overall evaluation shows that the RSIs + LIFs-based dataset has the best recognition effect. Qualitative and quantitative evaluation of the models with three different training datasets shows that the LIFs provide additional landslide feature information that helps improve the accuracy of landslide recognition, and these influencing factors are equivalent to increasing the number of landslide features being extracted. Taking the RSIs + LIFs-based datasets as an example, the landslide is controlled by spectral, elevation, slope, and aspect factors.
Our analysis showed that the best landslide recognition results were obtained using the RSIs + LIFs dataset. Therefore, we will only use the RSIs + LIFs dataset as inputs for all subsequent experiments.
4.2. Performance Analysis of Different Model Types
To illustrate the performance of the Transformer network in landslide recognition, traditional CNN: ResNet [
13] is compared. All experimental hyperparameters, training data, and other variables are consistent. ResNet is a CNN network, and CNN is a locally connected network. The attention mechanism is introduced into the Transformer network, which is able to associate information at different locations in the input sequence. The results are shown in
Figure 8.
Figure 8a shows the landslide recognition results of the ResNet model, indicating that large landslides have been recognized. However, there are omissions, misdetection, and noise phenomena in the landslide map, and small landslides are not recognized adequately. In addition, the recognized landslide area shows significant inconsistencies with the actual extent.
Figure 8d shows the results of landslide recognition using the DST model, which exhibit the closest recognition results to the actual landslide range. The boundaries of the model almost perfectly match the actual landslide area, demonstrating its high accuracy in landslide recognition.
Table 5 presents the quantitative evaluation of the two models. DST achieved outstanding performance, with the highest OA, Precision, Recall, F1, and Kappa of 98.1717%, 98.1672%, 98.1667%, 98.1615%, and 0.9766, respectively. As can be seen from
Table 5, the overall accuracy of the DST is 6% higher than that of the ResNet in the study area of Zigui. The attention mechanism is introduced into the transformer network. Self-attention can produce more interpretable models. The individual attention heads can learn to perform different tasks and thus learn more landslide features. The comparison between DST and ResNet shows that the multiple attention heads of the transformer network structure can help the landslide recognition network learn more landslide features to a certain extent and finally help the model get the best landslide recognition results.
To illustrate the performance of DST in landslide recognition, two widely used Transformer networks were compared: Swin [
26] and DeiT [
27]. All experimental hyperparameters, training datasets, and other variables were consistent. Swin uses a new hierarchical approach called “local attention windows” to reduce computation and memory consumption while maintaining accuracy. DeiT uses knowledge distillation to reduce the number of model parameters, which in turn increases execution speed. The results are shown in
Figure 8.
The results of the Swin model are shown in
Figure 8b. Compared with the DeiT recognition result, the landslide recognition ability is significantly improved, the landslide boundary is closer to the actual landslide extent, the landslide map is smooth, and the noise phenomenon is significantly reduced.
Figure 8c shows the recognition results of the DeiT model. It can be seen that, overall, most of the landslides are recognized. The recognized landslide areas largely match the actual extent, and the landslide map has noise phenomena.
Figure 8d shows the results of the DST model. The recognition results are closest to the actual landslide extent, the landslide map is free of noise, and the boundaries are almost perfectly matched.
The quantitative evaluation of the three transformer models is shown in
Table 5. DST achieved the highest OA, Precision, Recall, F1, and Kappa, 98.1717%, 98.1672%, 98.1667%, 98.1615%, and 0.9766, respectively. DST is a Swin-Transformer-based knowledge distillation model that we proposed. As can be seen from
Table 5, the overall accuracy of DST is 0.5% and 7% higher than that of Swin and DeiT, respectively. The network inherits the local attention window and hierarchical structure proposed by the Swin-Transformer network, which enable the model to handle multi-scale inputs and capture different features at different levels. The comparison between DST and DeiT shows that the improvement of the network structure can help the landslide recognition network recognize more landslide features to a certain extent and then recognize more landslides to get the best results. The comparison between DST and Swin shows that the OA, Precision, Recall, F1, and Kappa of the two networks are similar because they have the same feature recognition structure.
4.3. Performance Analysis of Different Model Efficiency
We have provided a quantitative evaluation in
Table 6, where we have evaluated four models based on various efficiency metrics, including the number of model parameters, FLOPs, average iteration time, and model running time. Our proposed DST model exhibits the largest number of parameters among the four, with 27.6 M, surpassing ResNet, Swin, and DeiT by 4 M, 0.076 M, and 4.7 M, respectively. The increased number of parameters in our model implies a more complex structure that can capture intricate relationships between input features, ultimately leading to improved accuracy and performance in landslide recognition.
DST has the lowest number of FLOPs, which is crucial for improving computational efficiency, especially in landslide recognition applications after geological disasters. Despite the large number of parameters, our model requires only 2.83 GFLOPs, which is the lowest among the four models and is 1.8242 GFLOPs, 1.741 GFLOPs, and 2.0284 GFLOPs less than ResNet, Swin, and DeiT, respectively.
In terms of the average iteration time, DST has an iteration time of 0.0981 s, which is similar to ResNet50’s iteration time of 0.0959 s. This means that DST can be trained faster than other models with a similar number of parameters. In addition, DST uses distillation optimization techniques that help reduce the computational load during training, which contributes to its relatively fast average iteration time.
Finally, DST has a relatively fast model running time, which makes it well-suited for real-time landslide recognition applications. In summary, our proposed landslide recognition model has several advantages over the other three models, including a high number of model parameters, the lowest number of FLOPs, the second-fastest average iteration time, and the third-fastest model running time. These advantages make it suitable for applications where timely landslide recognition is critical, providing higher accuracy, faster speed, and more efficient learning.
6. Conclusions
Geological disasters are common in China, with landslides being the most common. Unfortunately, the lack of information on the location, magnitude, and distribution of landslides often hampers post-disaster emergency response and rescue efforts. To address the problems of low data processing efficiency and long model processing times in current deep learning research, this paper proposes a knowledge distillation network based on Swin-Transformer. This approach aims to improve model processing efficiency and speed by reducing model parameters while maintaining the accuracy of landslide recognition. The study used nine LIFs and RSIs to construct a landslide inventory dataset, which significantly improved the recognition performance and accuracy of the model, resulting in improved discrimination of landslides from other ground features.
The proposed DST model, which uses the Swin-Transformer as its backbone, outperformed other landslide recognition networks in terms of running speed while maintaining recognition accuracy. The test results based on the RSIs + LIFs dataset showed that the proposed DST model achieved the highest OA, Precision, Recall, F1, and Kappa, reaching 98.1717%, 98.1672%, 98.1667%, 98.1615%, and 0.9766, respectively. These results demonstrate the importance of landslide recognition methods and the promising potential of deep learning and Multi-Feature Remote Sensing Data in landslide recognition.