1. Introduction
Traditional Chinese settlements (TCSs) are the products of agricultural societies [
1,
2]. Due to the limited productivity level, the advantages and disadvantages of natural conditions play a decisive role in agricultural production and human life [
3,
4]. The rational use of natural resources and topographic advantages to avoid harm is the primary criterion for settlement site selection. Therefore, the ancient Chinese were adept at determining the place of residence based on geographies and natural resources, such as mountain shapes, riverbanks, forests, and pastures, to create relatively suitable production and living spaces [
5,
6]. These patterns with regional environmental characteristics are regarded as settlement environmental patterns, which combine architecture and environment to realize the harmonious coexistence between humans and nature. Cognitive environmental patterns are useful for modern urban and rural planners to discover the traditional habitat wisdom embedded in TCSs, thereby guiding new village planning. However, most TCSs are located in remote areas with limited access, making it difficult to conduct manual surveys. Therefore, it is valuable to research the automatic recognition method of environmental patterns using the environmental data of TCSs.
Traditionally, research on environmental patterns primarily relies on phenomenal summaries and inductive descriptions, focusing on the internal and external conditions, such as geographical distribution, topographic conditions, and material culture, to classify TCSs in a qualitative manner [
7,
8,
9]. To improve quantification, researchers have conducted quantitative comparative studies of TCSs using statistical learning methods [
10,
11,
12,
13]. In recent years, statistical machine learning has achieved great success in many fields, such as agriculture [
14], forestry [
15], and climatology [
16] by combining the experience, prior knowledge, and conceptual understanding of human experts, which inspires experts in traditional settlement studies to explore digital transformation. Consequently, scholars attempt to use the analytic hierarchy process to quantify the dominant elements of TCSs and then apply machine learning methods, such as SVM [
17], clustering [
17,
18], and random forest [
19], to scientifically compare and classify them. Although these methods provide some ideas for environmental pattern recognition in TCSs, manually constructed features not only have difficulty describing complex human-land relationships, but also rely on features selected by the algorithm designer and environmental pattern recognition rules that depend on user-configured parameters. Setting these parameter values to obtain accurate recognition is difficult, especially when multiple parameters are involved.
In recent years, high-resolution remote sensing images have enabled the rapid acquisition of spatial structures, morphological characteristics, and environmental information related to settlements [
20,
21,
22], which enables the remote sensing image scene classification methods to provide a new way to automatically recognize the TCSs. Among them, relying on the deep neural network can automatically attain global features from the input data and treat the scene classification task as an end-to-end problem [
23,
24,
25]. The features learned by deep learning methods are often better and simpler than those designed manually. However, the drawback of deep learning is also obvious. High classification performance presupposes sufficient supervised learning samples, which is often difficult. In most cases, we only have a small number of available TCS samples, and the classic classification networks [
26,
27,
28,
29] perform poorly when they encounter few samples. In addition, TCSs are formed spontaneously during the long-term interaction between humans and nature and have distinctive regional characteristics, which makes it difficult for trained models to effectively recognize the patterns of TCSs in unknown areas.
To address these limitations, we investigate the application of deep learning techniques in the environmental pattern recognition task of TCSs with the workflow shown in
Figure 1. First, we download remote sensing images and digital elevation model (DEM) data of TCSs from three different topographic areas in China. Among them, DEM is a three-dimensional array-based digital elevation model that represents the surface morphology of the Earth by quantifying the mean altitude of the sampled areas through discretized elevation data. It can effectively supplement missing elevation information in remote sensing images. With some preprocessing work, a new TCS dataset by region was built, which includes 648 TCSs with five environmental patterns, such as river valley, foothill, and riverine. Second, three representative convolutional neural networks (CNNs), AlexNet [
26], ResNet [
27], and DenseNet [
28], are used to benchmark the new dataset. The three CNNs achieved near-perfect performance on the training set but performed poorly on the test set. Because the samples in the training and test sets are from different areas, this leads to serious overfitting problems in the CNNs under the conditions of sparse samples and regional differences. To solve this problem, we propose a new deep learning method by introducing pre-segmentation and metric-based meta-learning techniques to CNNs. Specifically, a semantic segmentation model is used to segment the input data of remote sensing images and DEM data into settlement environment maps composed of seven elements, including mountains, water, forests, and farmland. Subsequently, the environmental pattern recognition of TCSs containing unknown areas is regarded as a few-shot classification problem [
29,
30], where the areas with a large number of samples will be used as the base dataset to train the model, and the areas containing only a small number of samples will be used as the novel dataset, enabling few-shot recognition using the similarity between the support and query samples. Finally, we perform model training, evaluation, and comparative evaluations to demonstrate the effectiveness of the proposed method. This work provides a new way to research and recognize the environmental patterns of TCSs. The primary contributions are as follows:
The environmental pattern recognition of TCSs is formalized as an image processing task, addressed by a deep learning model trained with remote sensing images and DEM data. More specifically, these two types of data are combined into four-channel inputs to extract environmental features and perform automatic recognition using CNNs.
A semantic segmentation model is used to segment the input data into settlement environment maps consisting of dominant elements, which helps to fuse expert prior knowledge to reduce the influence of noise in remote sensing images and improve interpretability.
A metric-based meta-learning method incorporating pre-segmentation strategies is proposed to achieve the few-shot recognition of environmental patterns under conditions of sample scarcity and geographical differences in TCSs.
The rest of this paper is organized as follows. In
Section 2, we review the related work on TCS classification research and introduce our idea.
Section 3 describes the construction of the new dataset.
Section 4 introduces the proposed method.
Section 5 contains experiments and analysis, and the last section concludes.
3. Dataset Construction
In this section, we construct a labeled TCS environmental pattern dataset for the training and validation of the deep learning models. First, remote sensing images and DEM data of TCSs are collected from three areas with different topographic conditions in China. Next, the collected data are preprocessed to generate usable training and testing data.
3.1. Data Collection
TCSs are formed spontaneously over the course of long-term interactions between humans and nature, and they demonstrate distinct geographical characteristics [
50]. According to the Ministry of Housing and Urban-Rural Development of China, there are currently 6821 recorded TCSs throughout the nation. Information on them has been introduced on the Traditional Chinese Settlements Digital Museum (
http://www.dmctv.cn/directories.aspx) (accessed on 20 February 2023). For the purpose of automatic recognition of environmental patterns using deep learning techniques, data from three different areas with a variety of landforms have been collected. The first area is the Qiandongnan Miao and Dong Autonomous Prefecture (Qiandongnan,
Figure 2b, 107°17′ E–109°35′ E, 25°19′ N–27°31′ N) on the Yunnan-Kweichow Plateau, with an area of 30,282 km
2. This region is home to many well-preserved TCSs. The second area is Shaanxi Province (
Figure 2c, 105°29′ E–111°15′ E, 31°42′ N–39°35′ N), situated in the Chinese hinterland. It encompasses an area of 205,624 km
2 and is renowned for its diverse range of landforms, including loess tablelands, mountains, plains, and basins. Lastly, the Anhui Province (
Figure 2d, 114°54′ E–119°37′ E, 29°41′ N–34°38′ N), located in the Yangtze River Delta region of East China, offers an area of 140,100 km
2, consisting of various plains, hills, rivers, and lakes.
Next, we investigate five environmental patterns of TCSs in the above areas, including river valley, foothill, riverine, hillside, and plain. Among them, the settlements with river valley patterns usually occupy a valley, are surrounded by mountains, and are often located near rivers. The settlements with foothill patterns tend to be located on flat slopes, along ridges, or at the base of hills. Riverine settlements are typically situated near rivers or dispersed along these waterways in a band. Hillside patterns of settlements often exist on the sides of mountains, characterized by a steep drop-off between peaks and deep ravines. The settlement planning boundary is broken and follows the terrain in multiple directions. Finally, plain settlements are usually situated on flat plains in patches, and they often display higher densities and structural forms.
To acquire the settlements and their environmental data, Pleiades satellite images from Airbus Defence and Space (with a spatial resolution of 0.5 m) are selected. The remote sensing images of five environmental pattern TCSs in the mentioned areas are cropped to 2560 × 2560 pixels, with the settlement being located at the center. Additionally, the corresponding DEM data with a 10 m spatial resolution are also considered, based on the latitude and longitude. A sample of 648 TCSs is collected, consisting of a remote sensing image paired with one piece of DEM data, as shown in
Table 1. The number of TCSs of each type varied greatly, and even the same type had significant regional distinctions across different regions. An example of this difference could be seen in the hillside pattern settlements displayed in
Figure 3.
3.2. Data Preprocess
After data collection, a dataset of TCS environment patterns was constructed for training deep learning models. This dataset consists of two parts: one is the environmental pattern classification labels for training a classification model, and the other is the semantic segmentation labels for training a segmentation model.
For the semantic segmentation labels, research has identified natural environmental elements such as mountains, rivers, forests, vegetation, and farmland, in addition to their spatial relationships with settlements, as the main components of TCS environment patterns [
51,
52]. Mountains are especially important, having a significant influence on the characteristics of a settlement, such as production methods and factor safety [
53,
54,
55]. Rivers and lakes provide essential resources such as drinking water and transportation [
56,
57], while forests offer materials for settlement development and maintain comfortable microclimates [
58]. The distribution of agricultural land dictates the productive and social properties of a settlement [
59,
60], and vegetation enhances soil quality, water conservation, and disaster prevention [
61]. All these factors combine to form a relationship pattern between the settlement and its environment [
62].
Based on this research, seven categories that represent environmental features have been identified. These categories and their contents for semantic segmentation are outlined in
Table 2. Remote sensing images of TCSs are then semantically annotated according to the definitions given in
Table 2. In cases where one element conflicts with another, such as mountains and water, the element with higher influence is given precedence, e.g., water. Therefore, this is not a land cover classification, but a description of the human–land relationship in TCS using the settlement environment map. Five samples of TCSs with different environmental patterns and their corresponding settlement environment maps are shown in
Figure 4.
Finally, the dataset is split into training, validation, and testing sets. The training and validation sets come from Qiandongnan and are randomly split by a ratio of 4:1. The test sets come from the Shaanxi and Anhui provinces. The reason for this dataset partition approach is that TCSs are widely distributed and have regional differences. It is not practical to train a model for each region, and collecting all the TCSs’ data to build a massive dataset would require a significant amount of manpower. Therefore, we used two different regions to construct the test set to verify the model’s generalization ability and performance. The details of the dataset division are shown in
Table 3.
4. Proposed Methods
In this paper, we propose a metric-based meta-learning method for the few-shot recognition of environmental patterns in TCSs. We outline the proposed framework, which consists of four stages, as illustrated in
Figure 5. First, a semantic segmentation model is trained using a cross-entropy (CE) loss function to extract settlement environment maps consisting of environmental elements from TCS remote sensing images and DEM data. Second, a CNN is trained on all base categories, with its final fully connected (FC) layer removed to obtain a feature extractor, represented by
. Third, in the meta-training stage, a meta-classifier,
, is trained on multiple episodes, each containing a support set,
, and a query set,
. For single episodes, the category means of query features and support features are compared by scaling the cosine distance. During the training process, each minibatch contains several tasks, and the average loss is calculated. Finally, the classifier,
, is evaluated on the episodes drawn from the test set during the meta-testing stage. In the following sections, the segmentation model and meta-classification model are discussed in detail.
4.1. Pre-Segmentation Model
The purpose of the pre-segmentation model is to extract the dominant environmental elements of a TCS from remote sensing images and DEM data to get the settlement environment map. In this regard, the DeepLab V3+ model [
63] is modified to accept four-channel inputs, as presented in
Figure 6. The kernel size, step size, convolution layers, pooling layers, deconvolution layers, and activation functions are the same as those proposed by Chen et al., with the exception that the input layer was adapted to accommodate four-channel data that had been derived by concatenating remote sensing images and DEM data.
4.2. Meta-Classification Model
In few-shot settings, the environmental pattern recognition problem of TCSs can be considered a set of tasks, each containing classes with support samples and query samples in each class. These tasks are called , the labeled TCS samples are called the support set, and the TCS samples to be recognized are called the query set. The goal is to query the unlabeled TCS samples using only the labeled samples for environmental patterns. To solve this problem, a meta-classification model has been designed, containing three distinct stages.
The first stage is training the feature extractor by training a CNN on all base categories, which will remove the final fully connected layer and obtain
. The second stage is the meta-learning stage, which is the main component of the model; the task here is to improve generalization. Here, many episodes are extracted from the base dataset consisting of
input and output samples randomly selected from each category. This makes a total of
samples that are trained per episode, and the parameters of the classifier
are shared over all episodes. Given a few-shot task with support-set
, let
denote the few-shot samples in class
and compute a mean embedding
as the centroid of class
as follows:
For query sample
in the few-shot task, the probability that
belongs to class
is predicted based on the distance between the embedding of
and centroid
of class
. Here, the cosine similarity is used for the metric, and a learnable scale parameter
is added to adjust the original range of values of cosine similarity
;
is initialized to 10. With
denoting the cosine similarity of the two vectors, the prediction can be formalized as follows:
The last stage is the meta-testing stage, where the generalization performance of is evaluated by constructing from a sample of TCSs from different areas. In the meta-testing stage, a new set of episodes is randomly selected from , which consists of a new support set and a new query set , and can be used to predict .
5. Experiments and Results
In this section, three representative CNNs are first benchmarked on the new TCS dataset to survey their recognition accuracies. Next, the proposed method is applied to obtain new models, and the performance of those models is compared with the baseline to observe the improvement in generalizability.
5.1. Results of the Baselines
To test the effect of training deep learning models on the TCS environmental pattern dataset, three CNNs are used as benchmarks, including AlexNet [
26], ResNet50 [
27], and DenseNet121 [
28]. All models are trained on the training set with 200 epochs and a batch size of 32, using the Adam optimizer with an initial learning rate of 0.001 and a decay factor of 0.1, decaying at 50 and 100 epochs. The input images are remote sensing images of TCSs, and each image is resized to 256 × 256 pixels. Standard data augmentation techniques are applied, including random resizing, cropping, and rotation. The accuracy rate is the average of the ratio of correctly recognized samples in each category to all samples in that category. The results are shown in
Table 4, and the learning curves are shown in
Figure 7.
As can be seen from the learning curve in
Figure 7, all three CNNs struggle with severe overfitting. They achieved almost perfect performance on the training set, while performing considerably worse on the validation set. For example, the ResNet50 model achieved 100% accuracy on the training set, but only 63.75% accuracy on the validation set and even lower accuracy on the two test sets, with an average accuracy of only 48.46%. In the loss curve of the ResNet50 model, the loss on the training set keeps decreasing, though it continues increasing on the validation set after the oscillation. A similar overfitting phenomenon is observed in the AlexNet and DenseNet121 models. This indicates that overfitting is a significant problem when training neural networks with small-sized unbalanced datasets, particularly when dealing with complex input data.
5.2. Results of the Proposed Methods
To address the overfitting problem caused by sparse data, the CNNs are trained using the proposed method. The semantic segmentation model is first trained to extract the settlement environment map. Next, the three CNNs are tuned to accept single-channel maps, naming them AlexNet-PS, ResNet50-PS, and DenseNet121-PS. Following the same configuration, we retrained the three tuned CNNs and obtained the test results in
Table 5 and
Figure 8.
As shown in
Table 5, all models had improved accuracy once given the settlement environment map as input. ResNet50-PS achieved an average accuracy of 70.37%, a 21.91% enhancement compared to the baseline. This validates the notion that remote sensing images of TCSs contain noise that interferes with the relations between the inputs and the outputs, resulting in the models remembering the noise features instead.
The learning curves of the models are featured in
Figure 8. This highlights the suppression of the overfitting issue, yet there remains a substantial gap between the validation set and test set accuracy. For example, DenseNet121-PS demonstrated a maximum accuracy of 90% in the validation set, while reaching only 72.13% in the test set. This reflects the inconsistency of the data distributions between the training and test sets. Considering that TCSs have geographical differences, we applied the meta-classification method for mapping the input data into a feature vector suitable for comparison, whereby we constructed a series of meta-tasks.
The gained models, called AlexNet-PS-MC, ResNet50-PS-MC, and DenseNet121-PS-MC, are each trained with 3-way 1-shot and 3-way 5-shot tasks, each supported by five query samples. After 30 epochs, the highest accuracy model from the validation set was selected for testing, with its accuracy measured as the average of 200 tasks from the test set. In addition, we construct two state-of-the-art few-shot classification models, Meta-Baseline [
47] and Meta DeepBDC [
49], and adjust them to accept four-channel input data. Both models are trained using the same settings and compared with the proposed method. The recognition results for the Shannxi and Anhui provinces are shown in
Table 6.
In
Table 6, as the number of samples in the support set increases, every model achieves accuracy gains because a large support set facilitates the model to learn generalized features. Compared with Meta-Baseline and Meta DeepBDC, the proposed method achieves accuracy leadership in both cases. This confirms our conjecture that the excessive noise in remote sensing images makes it difficult for the feature extractor to extract good features. This reflects the importance of the pre-segmentation module, which can help the feature extraction module to obtain better features by incorporating human prior information, thus improving the classification performance. The recognition accuracy of the proposed method in the 1-shot case is already higher than the baseline because the classic classification network contains fully connected layers with a strong fitting ability, and these layers can undergo severe overfitting in case of sample scarcity. Although dominant elements of the input data are extracted, the recognition performance of the three CNNs in new areas is still low. Removing the fully connected layers of the three CNNs and applying meta-learning methods finally remedied the overfitting issue. Experiments revealed that the proposed method can effectively improve the generalization capability and performance of deep neural networks in TCS environment pattern recognition tasks.
5.3. Ablation Study
In this section, we conduct ablation studies to analyze how each component affects the environment pattern recognition performance. We study the following five components of our method: (a) effect of pre-segmentation; (b) effect of data augmentation; (c) effect of pre-training; (d) effect of meta-training; (e) effect of input size. The model used is DenseNet121-PS-MC with the highest recognition accuracy. The test set is Test2, from Anhui Province, and includes one-hundred forty-six samples and five environmental patterns.
Table 7 shows the results of the ablation study.
It is observed that when the pre-segmentation module is removed, the classification performance of the model degrades significantly in different shot settings, and the accuracy decreases by 10.41% in the 3-way 1-shot case. As the pre-segmentation module is removed, the background information in remote sensing images “spoofs” the model, making it focus on irrelevant noise. This is especially noticeable in small-size datasets. Data augmentation operations can improve the performance of the model to some extent, but the improvement is limited. Before meta-training, pre-training is introduced to improve the feature representation ability of the model in the case of a few samples, which gives a good initialization of the model; therefore, removing pre-training leads to a decrease in recognition performance. After the meta-training stage is removed, the recognition accuracy of the model decreases by 9.78% in the 3-way1-shot case. This is because meta-training adjusts the scaling parameters in the metric module and optimizes the feature extractor as a way to learn task-level distributions. In addition, the size of the input images also affects the performance of the model, as scaling down inputs loses information. After adjusting the input size to 128 × 128, the recognition accuracy decreased by 1.71% in the 3-way 1-shot case. In experiments using input images of 512 × 512 in size, the recognition accuracy improves by only 0.19% but increases the computational cost exponentially. Therefore, we chose 256 as the final input scale for the model.
6. Conclusions
This paper investigates the application of deep learning techniques to achieve the automatic recognition of environmental patterns for TCSs. To represent the complex human–land relationships in TCSs, we formalize this task as an image processing problem, using deep learning methods to automatically extract environmental features from TCS remote sensing images and DEM data. Specifically, we construct a new labeled TCS environmental pattern dataset and perform benchmarking using several representative CNNs. To address the problem of low recognition rate due to sample scarcity and geographical differences, we utilize a semantic segmentation model to construct settlement environment maps informed by human prior information, and we employ a metric-based meta-learning method to perform few-shot recognition using sample similarity. Extensive experiments are conducted to verify the effectiveness of the proposed method. The constructed DenseNet121-PS-MC model achieves 90.46% recognition accuracy in the self-constructed TCSs environmental pattern dataset, which is better than other existing methods, and it achieves effective recognition of environmental patterns in different areas.
This study explores the use of intelligent methods to assist TCS surveys, providing an effective analytical tool for urban and rural planners. However, the proposed method still has some limitations; for example, only five representative environmental patterns are selected in the study, and it is not known whether other environmental patterns can be effectively recognized. For future work, we plan to expand our dataset to incorporate more areas of TCSs and additional environmental patterns. Meanwhile, the latest deep network structures will be investigated to further improve recognition accuracy. Finally, we are also interested in exploring unsupervised techniques to avoid the tedious task of manual data labeling.