1. Introduction
The escalating global phenomenon of climate change, resulting from the warming of the earth, has reached a level of utmost urgency. In its Second Working Group Report of the Sixth Assessment Report [
1], the Intergovernmental Panel on Climate Change (IPCC) highlighted the profound impact of climate change on various human systems, including ecosystems, water security, food production, health and well-being, cities, residences, and infrastructure. According to the First Working Group Report, the global average temperature from 2011 to 2020 has risen by 1.09 °C compared to the pre-industrial era. Furthermore, the IPCC announced that even under scenarios with extremely low greenhouse gas emissions, such as achieving zero carbon dioxide (CO
) emissions by around 2050 or later and subsequent negative emissions, there is a possibility of global temperature increase reaching 1.5 °C between 2021 and 2040. The report also indicated that the frequency of extreme temperature events in terrestrial areas that occur once every 10 years or once every 50 years is likely to increase by 4.1 times and 8.6 times, respectively, due to 1.5 °C of warming. In addition to the projected increase of 1.5-fold in decadal events for heavy rainfall in terrestrial areas and a 2.0-fold increase in agricultural and ecological droughts in arid regions, it is anticipated that severe snowstorms and super typhoons will undergo further intensification [
2].
The First Working Group Report revealed a nearly linear relationship between cumulative CO
emissions and the phenomenon of global warming. To limit the temperature increase beyond the pre-industrial levels to 1.5 °C with a probability of 67% or higher, it was estimated that the remaining CO
emissions should not exceed 400 billion tons. The Third Working Group Report stated that in scenarios where global CO
emissions reach zero, approximately 74% of the required global emissions reduction would be achieved through reductions in CO
emissions from energy supply and demand. While renewable energy has emerged as a prominent solution, it is recognized that a combination of renewable energy sources and fossil fuel systems is still necessary to meet the current energy demand. In light of this, the present study specifically focuses on carbon capture and storage (CCS) technology [
3], assuming a high carbon capture rate of 90–95% from fossil fuel systems.
Achieving carbon neutrality requires a balance between emissions and removals of greenhouse gases [
4]. However, in many sectors, complete decarbonization is proving to be a challenging reality. A prime example of this challenge is the power generation sector. In this context, CCS technology plays an indispensable role in effectively reducing CO
emissions and achieving the goal of carbon neutrality. The automotive industry is also progressing towards decarbonization. The transition from internal combustion engines to electric motors has led to a reduction in CO
emissions. However, charging the batteries of electric vehicles requires a substantial amount of electricity, and relying solely on renewable energy sources to meet this demand presents a formidable challenge. Nuclear power as an alternative energy source remains a subject of debate, and its utilization presents significant challenges. These challenges include the management of nuclear waste, the threat of terrorism, and the need to learn from past nuclear power plant accidents while undertaking long-term decommissioning processes, which are complex and require careful consideration. To achieve carbon neutrality, a diverse array of strategies and approaches is imperative.
CCS refers to the collective techniques of capturing carbon dioxide emitted from factories, power plants, and other sources and storing it underground before its release into the atmosphere [
5]. The selection of storage locations is based on geological models derived from borehole surveys and probability statistics. However, the current statistical methods used to create geological models from limited sampling information obtained through borehole drilling present challenges in terms of accuracy. By obtaining investigations of the entire geological formation, there is a possibility to construct a geological model that is more precise and accurate. In our research project [
6], we specifically focus on outcrops, which are exposed parts of geological formations visible on the Earth’s surface and are covered by surface soil and vegetation. By analyzing images of outcrops, we aim to identify optimal locations for storage by creating high-precision geological models. Therefore, this study aims to explore the optimal backbone for semantic segmentation of outcrop images using deep learning techniques, taking into consideration both the latest advancements in the field and computational efficiency to minimize processing time.
In this study, the primary focus was on the examination of the outcrop shown in the photograph presented in
Figure 1. This paper presents our research efforts in developing a precise and efficient methodology for the classification of geological formations in outcrop images. Our approach leverages deep learning-based semantic segmentation techniques for this purpose, aiming to achieve accurate and reliable results. We investigate various backbone architectures to determine the most suitable approach for this task. By accurately characterizing geological formations, our proposed methodology can contribute to identifying optimal locations for CCS and promoting effective carbon sequestration, which is crucial for mitigating the impact of climate change.
2. Related Studies
The field of computer vision has extensively utilized segmentation techniques for pixel-wise object classification in images. Segmentation serves as a fundamental technology in various practical tasks, including autonomous driving and other computer vision applications [
7]. Until recently, Convolutional Neural Networks (CNNs) [
8] have been the predominant approach for segmentation tasks in computer vision, following the introduction of Fully Convolutional Networks (FCN) [
9] and their subsequent advancements. In 2020, Vision Transformer (ViT) [
10] was introduced as an architecture that adapted the successful Transformer model [
11] from natural language processing to image recognition. ViT surpassed the state-of-the-art BiT (Big Transfer) [
12] method in terms of accuracy, leading to a surge of research utilizing ViT in the field of image segmentation.
As a transformer-based architecture for image recognition, ViT builds upon the Transformer model that revolutionized natural language processing tasks. By omitting convolutional operations, ViT achieves improved computational efficiency and scalability. Unlike CNNs, Transformers lack the inherent inductive bias that considers the proximity of information in convolutional layers as relevant, necessitating a large amount of training data for generalization. Geirhos et al. [
13] highlighted the classification characteristics of CNNs, which prioritize textures over object shapes, as differing from human perception. Conversely, Tuli et al. [
14] revealed that ViT’s classification characteristics are biased towards object shapes and align more closely with human perception.
In recent years, there has been a growing interest in investigating not only CNNs and ViTs as individual backbone architectures but also hybrid backbone architectures that combine both approaches. Moreover, there has been a resurgence of interest in employing a simplistic backbone architecture that solely comprises Multi-Layer Perceptron (MLP) networks [
15]. MLPs are feedforward neural networks composed of multiple layers of interconnected perceptrons, each consisting of weighted inputs, an activation function, and a bias term. MLPs can be seen as a deep extension of traditional neural network architectures, where the concept of depth refers to the increased number of layers in the network. By stacking multiple layers, MLPs can capture hierarchical representations of input data, enabling them to learn intricate and abstract features through their deep structure. This deep layering allows MLPs to effectively model complex relationships and patterns in the input data, making them powerful tools in various machine learning tasks. While MLP-based backbones demonstrate performance comparable to BiT and ViT in classification tasks, their applicability to segmentation tasks has not been fully explored.
Traditionally, high-precision 3D models of geological strata were created using ground-based laser scanning methods [
16]. However, this approach has limitations such as the weight of surveying equipment, the need for scans from multiple field-based positions, and the time-consuming nature of data acquisition. Consequently, the modeling process presents significant challenges in regions where conducting in situ measurements and data collection is impractical or entails risks. In such cases, drones equipped with cameras are being utilized. Researchers such as Corradetti et al. [
17] employed drones to capture photographs of cliffs composed of nearly vertical outcrops, creating 3D models that were used for crack analysis and understanding crack propagation patterns. Similarly, Sharad et al. [
18] used drones to capture high-resolution images of complex and hazardous landslides, generating cm-level accuracy 3D models. Javier et al. [
19] employed drones to create highly accurate and high-resolution 3D models for identifying and interpreting ancient Roman gold mining sites in Northwestern Spain, revealing areas such as excavation sites, canals, reservoirs, and drainage channels. Mirkes et al. [
20] proposed a semantic segmentation method for rock outcrops that leads to the detection and segmentation of various geometric features, including fractures, faults, and sedimentary layers. Zhang et al. [
21] state that most existing semantic segmentation methods are based on FCNs, which replace the fully connected layer with fully convolutional layers for pixel-level prediction. Malik et al. [
22] proposed a segmentation method using a model that combines U-Net [
23] and LinkNet [
24] to classify three classes: background, sandstone, and mudstone. They conducted an evaluation experiment on a self-collected dataset of 102 images from a field in Brunei Darussalam, demonstrating higher accuracy compared to conventional methods. However, their proposed method and the comparison methods were based on conventional CNN-based backbones, without considering recent advancements in deep learning. Vasuki et al. [
25] proposed an interactive segmentation method primarily using edge features extracted from rock images obtained using a drone. They focused on superpixels as the minimum resolution, emphasizing geological analysis and engaging in image sensing and analysis. However, their segmentation method relied on conventional image processing methods for feature extraction and did not incorporate learning-based techniques, which might not provide sufficient accuracy and generalizability for this application domain.
Although research utilizing drones in geosciences has gained momentum [
26], most studies focus on analyzing topography using 3D models generated from captured photographs. However, there are a limited number of reported studies [
22,
25] that apply segmentation-based approaches to classify lithostratigraphy in geological outcrops. This paper emphasizes the mounting interest in employing CNNs, ViT, and investigating hybrid backbone architectures for segmentation tasks, while also acknowledging the expanding utilization of drones in geological studies. Furthermore, it identifies a research gap concerning the application of segmentation-based approaches to the lithostratigraphic classification of geological outcrops.
3. OutcropHyBNet
We propose a novel approach named OutcropHyBNet, which combines a state-of-the-art CNN architecture, DeepLabv3+ [
27], and a transformer-based vision model, SegFormer, to tackle the task of stratum semantic segmentation in outcrop images. The overall architecture of our proposed method is illustrated in
Figure 2. OutcropHyBNet leverages the robust segmentation capabilities of DeepLabv3+ and the expressive power of SegFormer [
28] as the backbone networks for accurate and efficient stratum segmentation. To enhance the diversity of training data, we employ Only Adversarial Supervision for Semantic Image Synthesis (OASIS) [
29] in image synthesis. During the segmentation training process, our dataset includes both original outcrop images and synthetic images generated using OASIS. OASIS utilizes the power of generative models to produce synthetic outcrop images that manifest characteristics closely resembling those observed in real-world data. By incorporating OASIS-generated images into the training dataset, we expand the available data and improve the model capability to handle various outcrop images.
The OutcropHyBNet architecture is designed to harness the power of CNN and ViT backbones for accurate semantic segmentation of outcrop images. The input images are processed through both backbones, allowing for efficient feature extraction and comprehensive contextual understanding. The extracted features are further processed by additional layers to perform pixel-wise classification, resulting in the generation of high-quality segmentation maps. As the baseline model for OutcropHyBNet, we integrate DeepLabv3+ and SegFormer into the architecture. Herein, SegFormer is one of the state-of-the-art semantic segmentation models that adopts a transformer-based architecture [
30]. By leveraging the capabilities of SegFormer, we aim to improve the accuracy and performance of outcrop image segmentation in our proposed method. By contrast, DeepLabv3+ is a lightweight model that exhibits superiority in stuff classification. Although the ViT has gained significant attention in the field of computer vision, CNNs still demonstrate strong potential in segmentation tasks, particularly in areas involving texture and stuff [
31]. For this mechanism, OutcropHyBNet can flexibly utilize both backbones based on the segmentation target.
3.1. Semantic Image Synthesis
3.1.1. Data Augmentation with GANs
Generative Adversarial Networks (GANs) [
32] are a generative model based on adversarial training without extensively annotated training data [
33]. GANs offer a technique for generating realistic data, such as images, from random noise. In our previous study [
34], we demonstrated the power and effectiveness of image synthesis for semantic segmentation applications in agriculture.
The network architecture of GANs consists of two main components: a generator G and a discriminator D. The G is responsible for generating synthetic images, while the role of D is to distinguish between real images from a dataset and fake images generated by G. The G aims to deceive D by generating images that closely resemble real ones, while D strives to accurately classify the input images as real or fake. Both G and D networks are trained adversarially and simultaneously. The training process involves iteratively updating the networks in an attempt to achieve a dynamic equilibrium, where G becomes increasingly proficient at generating realistic images, and D becomes increasingly adept at discriminating between real and fake images. The G receives random noise as input and transforms it into synthesized images. The D, on the other hand, receives either real images from a dataset or generated images from G as input and outputs a probability score indicating the likelihood of the input being real.
By optimizing the respective objectives of
G and
D through backpropagation and gradient descent, GANs learn to generate high-quality synthetic data that closely resembles the real data distribution. Since their introduction, GANs have undergone significant improvements and spawned various derivative models. These improvements have expanded the capabilities of GANs and paved the way for extensive research in the realm of semantic image synthesis. In this study, we introduce OASIS [
29], a novel generative model based on GANs, which harnesses the power of the adversarial training paradigm to synthesize images with desired semantic content.
3.1.2. OASIS
In recent years, research on data generation has gained significant momentum, driven by the introduction of diffusion models (DMs) [
35]. Although DMs have demonstrated their efficacy in various vision tasks [
36], they often require substantial computational resources and impose a heavy memory burden. In this study, prioritizing ease of implementation and computational efficiency, we have selected OASIS [
29] as our model of choice, which is based on the GAN framework.
To generate high-quality images that align with the input semantic label map,
G requires
D, which can effectively capture semantic features at various resolutions. In the OASIS framework, the role of
D is structured as a multi-class segmentation task. The architecture adopted in
D is an encoder-decoder network, specifically based on the U-Net [
23] with skip connections. The segmentation task for
D aims to predict per-pixel class labels for real images, considering the given semantic label map as the ground truth. In addition to the
N semantic classes obtained from the label map, all pixels of the synthesized images are classified as an additional class. Therefore, the formulated segmentation task involves
classes, and OASIS employs a cross-entropy loss with
classes for training.
As the segmentation task deals with class imbalance due to varying class frequencies, there is a possibility that the performance may be hindered. To mitigate this issue, OASIS leverages pixel-level loss calculation in
D. Specifically, each class is weighted inversely proportional to the frequency of occurrence at the pixel level within a batch. This weighting scheme assigns higher weights to classes with lower frequencies, aiming to alleviate the impact of class imbalance and improve accuracy for classes with low occurrence. As a result, the contribution of each class to the loss is normalized, leading to improved accuracy for classes with low occurrence. The loss
of the updated
D is formulated as follows:
where
x represents real images,
H and
W represent the image height and width,
is the combination of noise and label map used by
G to produce synthesized images, and
D maps real or synthesized images to per-pixel
-class prediction probabilities. Here,
denotes a unit vector in a normed vector space. The ground truth label
t is a 3D tensor, where the first two dimensions correspond to spatial positions
, and the third dimension encodes the class
as a one-hot vector. When designing
G to align with
D, the loss function for
G is expressed as following.
To enable multi-modal synthesis through noise sampling, G is designed to synthesize diverse outputs from input noise. Hence, a noise tensor of size is constructed to match the spatial dimensions of the label map, where N represents the number of semantic classes, and M corresponds to the number of masks. During training, the 3D noise tensor is sampled channel-wise and fed to each pixel of the image. After sampling, the noise and label maps are concatenated along the channel dimension, forming a noise-label concatenation 3D tensor. This concatenation tensor serves as input to the first generation layer and spatially adaptive normalization layers of each generation block. The 3D noise has sensitivity at the channel and pixel levels, allowing for specific object-level image generation by sampling noise locally for each channel, label, or pixel during testing.
3.2. Semantic Segmentation
3.2.1. DeepLabv3+
For pixel-level image segmentation, DeepLabv3+ [
27] represents a significant advancement within the renowned DeepLab model family [
37]. This architecture has been specifically designed to excel in the task of precise and detailed segmentation, offering exceptional performance and accuracy. By leveraging advanced techniques and innovations, DeepLabv3+ pushes the boundaries of pixel-level image segmentation and stands as a testament to the ongoing progress within the DeepLab model family. DeepLabv3+ has garnered significant acclaim for their remarkable prowess in achieving precise and efficient semantic image segmentation. With its enhanced architecture and refined techniques, DeepLabv3+ builds upon the foundation established by its predecessors, pushing the boundaries of segmentation capabilities even further. Moreover, DeepLabv3+ has achieved outstanding performance on various benchmark datasets, surpassing previous state-of-the-art methods in terms of accuracy and computational efficiency. Its ability to capture contextual information at multiple scales and preserve fine details has made it particularly effective in tasks such as object recognition, scene understanding, and medical image analysis.
The architecture of DeepLabv3+ builds upon the strengths of its predecessors by incorporating an encoder-decoder structure along with atrous convolutions [
38] and atrous spatial pyramid pooling (ASPP) modules [
27]. The encoder network, usually based on pre-trained CNNs such as ResNet [
39] or Xception [
40], extracts high-level features from the input image while preserving spatial information. The atrous convolutions enable the network to capture multi-scale contextual information without significantly increasing the computational cost. The decoder network employs bilinear upsampling to restore the spatial resolution of the features obtained from the encoder. Additionally, skip connections from earlier layers are incorporated to ensure that fine-grained details are preserved in the final segmentation. The ASPP module further enhances the receptive field of the network by applying atrous convolutions at multiple dilation rates and capturing contextual information at different scales.
3.2.2. SegFormer
SegFormer [
28] adopts a ViT-based methodology, leveraging its distinctive Mix Transformer (MiT) encoder. The MiT encoder consists of a hierarchical Transformer, overlapped patch merging, efficient self-attention, and Mix-FFN. These components collectively contribute to the effectiveness and efficiency of the SegFormer model for segmentation tasks. Unlike ViT, which can only generate feature maps at a single resolution, the hierarchical Transformer in SegFormer produces multi-level feature maps. These maps provide both high-resolution coarse features and low-resolution fine-grained details, contributing to improved segmentation accuracy.
ViT incorporates Positional Encoding (PE) to capture positional information. However, the resolution of PE is fixed. As a result, when the resolution differs between training and testing, the accuracy may deteriorate. To address this issue, a Mix-FFN is introduced, which applies a convolutional layer directly to the feed-forward network (FFN).
SegFormer adopts a lightweight decoder consisting solely of MLP layers, known as the All-MLP decoder. This avoids the computationally expensive configurations used in other methods. The hierarchical Transformer encoder in SegFormer enables this simple decoder by having a larger effective receptive field (ERF) compared to the encoder of traditional CNNs.
3.3. Cross-Entropy Loss
To train DeepLabv3+ and SegFormer, a large-scale dataset annotated with pixel-level labels is required. Typically, the network is trained in a supervised manner using a cross-entropy loss function
given by:
where
N is the number of pixels;
C is the number of classes;
represents the ground truth label for pixel
i and class
j; and
is the predicted probability of pixel
i belonging to class
j.
3.4. Evaluation Criteria
In this study, we employ the Fréchet Inception Distance (FID) [
41,
42] as a formal evaluation criterion. By incorporating information about the underlying distributions and the representation of features, the FID metric provides a comprehensive assessment that captures the fidelity and resemblance of the generated samples to the real data. The FID metric utilizes a pre-trained Inception network [
43] that has been trained on the ImageNet dataset [
44]. The pre-training on ImageNet helps capture general visual features and enables transfer learning, where the learned representations are fine-tuned for specific tasks [
45]. By leveraging the representation power of the Inception network, FID provides a quantitative measure of the quality and diversity of generated images compared to the real image distribution. FID calculates the distance between the feature vectors extracted from the real images and the generated images, quantitatively evaluating the similarity between the two. The FID is defined as follows:
where
and
m represent the means of the feature vectors extracted from the generated images and the real images, respectively.
and
C represent the covariance matrices of the feature vectors.
Subsequently, to assess the quality of segmentation, Intersection over Union (IoU) is employed as the evaluation metric in this study. IoU represents the degree of intersection between the predicted region and the ground truth region, and mIoU represents the average IoU across all classes. IoU is calculated using the following equation:
Herein, True Positive (
) corresponds to the instances where both the prediction and the class are true. False Positive (
) represents the instances where the prediction is false, but the class is true. False Negative (
) denotes the instances where the prediction is true, but the class is false.
4. Preliminary Performance Evaluation with Benchmark Datasets
4.1. Data Profiles and Setups
We evaluated the performance of the proposed method, OutcropHyBNet, in a general context using two benchmark datasets: COCO-Stuff10K [
46] and ADE20K [
47]. These datasets encompass diverse scenes and objects commonly encountered in everyday environments, facilitating a comprehensive evaluation of the proposed method’s performance in real-world scenarios.
The COCO-Stuff10K serves as an extensively utilized benchmark dataset for tasks related to scene understanding and segmentation. Comprising 10,000 high-resolution images, this dataset features pixel-wise annotations. The images within this dataset exhibit diverse resolutions, ranging from to pixels, while maintaining an aspect ratio of 3:4. The dataset provides comprehensive annotations for both objects and stuff categories. It includes 80 object categories, such as person, car, and dog, and 91 stuff categories, such as sky, grass, and road. The pixel-level annotations enable detailed semantic segmentation of scenes, facilitating the evaluation and development of advanced computer vision algorithms. Moreover, the dataset provides a wide array of visual scenes, encompassing a comprehensive spectrum of both indoor and outdoor environments. It serves as a standard benchmark for evaluating and contrasting the performance of semantic segmentation models.
The ADE20K dataset is a widely used dataset for semantic segmentation tasks. It comprises more than 20,000 high-resolution images, specifically 150 objects and 50 stuff categories. All images in the dataset have a fixed resolution of 512 × 512 pixels. The ADE20K dataset provides pixel-level annotations for both objects and stuff categories, enabling fine-grained semantic segmentation. It covers a diverse range of scenes, including indoor and outdoor environments, and captures various objects and stuff categories commonly encountered in everyday life. Moreover, the ADE20K dataset is designed to facilitate research and development in scene parsing and semantic understanding. It serves as a benchmark for evaluating the performance of semantic segmentation models and has been widely adopted in the computer vision community. The inclusion of this dataset allows for a comprehensive assessment of the generalization capability of the proposed method, OutcropHyBNet.
4.2. Experimental Setup
For this study, we utilized MMSegmentation [
48], an open-source segmentation toolbox developed by OpenMMLab, as the designated implementation platform. MMSegmentation offers a comprehensive and versatile solution specifically tailored for semantic segmentation tasks. Its open-source nature and seamless integration with PyTorch provide us with a valuable resource for conducting our evaluation experiments. One of the key strengths of MMSegmentation lies in its wide array of segmentation models, catering to diverse requirements in the field. This rich collection of models establishes MMSegmentation as an invaluable asset for various developers. With its extensive toolkit, we can effectively address various segmentation tasks and explore different approaches, thereby enhancing the depth and breadth of our research and practical applications.
The computation for this study was carried out on a single NVIDIA RTX A6000 GPU. Renowned as a high-performance GPU, the A6000 is purpose-built to tackle professional workloads in various fields, including data science, deep learning, AI research, and content creation. Its exceptional capabilities make it an ideal choice for handling the intensive computational tasks. The A6000 has 10752 CUDA cores, 48 GB of GDDR6 memory, and a memory bandwidth of 768 GB/s. With its powerful architecture, it delivers exceptional performance for tasks such as deep learning training, real-time ray tracing, and high-resolution rendering. The parameters for each method were determined using the configuration file of the pretrained model that achieved the highest accuracy on the ADE20K dataset, which is provided by MMSegmentation [
49].
4.3. Class Balancing for Uneven Data
To mitigate the challenge of class imbalance [
50], we employed class balancing techniques [
51] as a simple and practical approach for data adjustment and enhancement. Let
x represent the number of pixels in a class and
y represent the total number of pixels excluding unlabeled pixels. The weight
w is calculated using the following equations:
where
represents the median of
z.
Table 1 provides a comprehensive overview of the calculated weights, which were determined considering the pixel occupancy ratio. The weights were assigned in such a way that they decrease as the pixel occupancy ratio increases, and conversely, they increase as the pixel occupancy ratio decreases. The approach aims to effectively mitigate the issue of class imbalance by assigning higher weights to underrepresented classes with lower pixel occupancy ratios. This strategy ensures that these classes receive greater attention during the training process, thereby addressing their significance in a more comprehensive manner. By incorporating these calculated weights, we aim to achieve a more balanced and accurate model performance, ultimately improving the overall effectiveness of our approach in handling imbalanced datasets.
We evaluate the performance of the models using the Intersection over Union (IoU) metric for each class.
Table 2 presents the comparison of class balancing results for both DeepLabv3+ and SegFormer models. From the results, we observe that class balancing has a significant impact on the performance of both models. SegFormer shows improvements in most classes, except for the Black class. The decrease in performance for the Black class in SegFormer can be attributed to a specific image (Image 10), where the IoU is significantly lower compared to other images. The IoU for the Black class in DeepLabv3+ remains relatively stable across all images.
4.4. Data Augmentation
Our proposed approach, OutcropHyBNet, utilizes OASIS to generate images and augment the dataset. By leveraging OASIS-generated images, we expand the breadth and depth of our dataset, enabling a more comprehensive representation of geological features and variations. Integrating OASIS into our methodology addresses the challenge of limited real-world outcrop data and enriches the learning process of OutcropHyBNet. The combination of synthetic and real data enhances the model’s capacity to accurately analyze and interpret geological formations with improved precision and reliability.
Table 3 presents the parameters used for this purpose. In this experiment, a dataset with a sampling number of 256 images was utilized, and the same dataset was used for both training and testing. DeepLabv3+ and SegFormer were used as the comparative methods. A total of 3661 images were used for evaluation, which consisted of 333 images generated using OASIS and
images from the dataset used in the full image experiment. The training and testing data were randomly allocated in a 9:1 ratio, resulting in 3294 images for training and 367 images for testing.
Table 4 shows the results of class balancing. For the Black class, both methods improved accuracy in 4 out of 6 images. For the Red class, DeepLabv3+ improved accuracy in 10 out of 13 images, while SegFormer improved accuracy in 8 images. Similarly, for the Cyan class, DeepLabv3+ improved accuracy in 11 out of 13 images, and SegFormer improved accuracy in 8 images. Regarding the Yellow class, DeepLabv3+ improved accuracy in 6 out of 13 images, while SegFormer improved accuracy in 4 images. In terms of mIoU, DeepLabv3+ improved accuracy in 10 out of 13 images, and SegFormer improved accuracy in 6 images. It is worth noting that the Yellow class exhibited a decrease in accuracy in more than half of the images for both methods. This can be attributed to the initial weight of 0.3792, which is significantly lower compared to the absence of class balancing.
Table 5 demonstrates the improved mIoU scores achieved by incorporating OASIS-generated images. These images significantly enhance the accuracy of segmenting and classifying geological formations in our proposed approach. Both segmentation methods showed improved accuracy, denoted
, for all classes compared to the dataset before augmentation. Particularly, they achieved an accuracy improvement of over 3% for the Cyan class. Therefore, it can be concluded that dataset augmentation using OASIS for data generation contributes to the improvement in accuracy. Furthermore, the consistent trend of CNNs backbones outperforming ViT backbones was observed throughout the evaluation.
4.5. Selection of Backbones
To verify the effectiveness of the proposed approach, a preliminary experiment was conducted for performance comparison using seven different network models with varying backbones. The backbones used for comparison were ResNet [
39], HRNet [
52], U-Net [
23], Swin Transformer [
53], MiT [
28], ViT [
10], and SVT [
54].
Table 6 presents the specific parameter configurations for each backbone utilized in this experiment. The common parameters included a batch size of 8, a class count of 4, 4 sampling patterns (64, 128, 256, and 512 images), an input image size of 256 × 256 pixels, and a training epoch set to 50. Regarding the input data, a random sampling was performed on 13 images, allocating them to training and testing data in a 9:1 ratio.
The left panel of
Figure 3 illustrates the accuracy of CNN-based methods [
37]. DeepLa-bv3+ [
27] consistently demonstrated the highest accuracy among all sampling numbers. Additionally, across all methods, the highest accuracy was achieved when the sampling number was 256 images. Subsequently, the right panel of
Figure 3 presents the accuracy of ViT and hybrid-based methods [
28,
57]. SegFormer consistently exhibited the highest accuracy across all sampling numbers. Moreover, excluding SEgmentation TRansformer (SETR) [
57], SegFormer achieved the highest accuracy when the sampling number was 256 images.
Figure 4 depicts the accuracy trends and distributions for all backbones and the top two models. The red lines correspond to ViT-based backbone [
10], the green lines represent hybrid backbones [
57], and the blue lines represent CNN-based backbones [
38]. The graph visually depicts how the accuracy of these methods varies across different experimental settings. Comparing the results, the methods can be ranked in terms of accuracy as follows: DeepLabv3+ [
27], SegFormer [
28], Twins [
54], ResNet [
39], and ViT [
10]. In other words, on the original dataset, CNNs outperformed ViT in terms of accuracy for this context.
Table 7 presents mIoU of each class [
27,
28]. Comparing the results, DeepLabv3+ demonstrated superiority for all classes except for the Black class at sampling numbers of 64. Additionally, in all sampling numbers except for 64 images, DeepLabv3+ outperformed SegFormer. Analyzing the mean scores, DeepLabv3+ consistently showed superior performance in all classes. Furthermore, in both methods, the classes with the highest accuracy were ranked as follows: Black, Yellow, Cyan, and Red.
4.6. Segmentation Results
Figure 5 shows the comparison results for all classes in each dataset. In ADE20K, DeepLabv3+ achieved an mIoU of 29.36%, while SegFormer achieved an mIoU of 41.38%. SegFormer demonstrated superiority in 154 out of 171 classes (90% of the total classes). In COCO-Stuff10K, DeepLabv3+ achieved an mIoU of 38.78%, while SegFormer achieved an mIoU of 48.40%. SegFormer exhibited superiority in 138 out of 150 classes (92% of the total classes, see
Appendix A).
In COCO-Stuff10K, DeepLabv3+ outperformed SegFormer in terms of accuracy for certain classes. Among the things classes, DeepLabv3+ exhibited higher accuracy than SegFormer in four classes: surfboard, sports ball, car, and mouse, out of the 80 classes. In the stuff classes, DeepLabv3+ demonstrated higher accuracy in 13 classes: platform, mountain, stone, straw, bush, bridge, roof, house, cabinet, floor-other, float-wood, carpet, and wall-panel, out of the 91 classes. Conversely, SegFormer showed higher overall accuracy compared to DeepLabv3+ in both datasets. Similarly, in ADE20K, DeepLabv3+ surpassed SegFormer in accuracy for specific classes. Among the things classes, DeepLabv3+ achieved higher accuracy than SegFormer in 4 classes: railing, base, food, and monitor, out of the 115 classes. In the stuff classes, DeepLabv3+ demonstrated higher accuracy in 8 classes: house, river, skyscraper, hovel, path, tower, stairway, and pier, out of the 35 classes. Once again, SegFormer exhibited higher overall accuracy than DeepLabv3+ in ADE20K. In both datasets, the percentage of classes where DeepLabv3+ showed superiority was higher in the stuff classes compared to the things classes. This can be attributed to the fact that stuff classes lack well-defined boundaries, and the CNN-based architecture of DeepLabv3 utilized by DeepLabv3+ may have provided an advantage in texture classification, as mentioned earlier.
Table 8 presents the top score classes observed in the COCO-Stuff10K dataset, while
Table 9 showcases the top score classes identified in the ADE20K dataset. These tables provide a comprehensive overview of the most prominent classes present in each dataset, shedding light on the prevalent semantic categories and objects captured in the respective datasets. The identification and analysis of these top score classes contribute to a deeper understanding of the dataset composition and can inform the development of more effective models and algorithms for semantic segmentation and scene understanding tasks.
Focusing on the stuff classes, which are the classes of interest in this study, the top 10 classes combined for both methods include 3 classes (15%) in COCO-Stuff10K and 8 classes (40%) in ADE20K. On the other hand, the bottom 10 classes combined include 24 classes (75%) in COCO-Stuff10K and 9 classes (45%) in ADE20K. Therefore, it can be inferred that stuff classes have a lower representation in the top classes and a higher representation in the bottom classes.
6. Conclusions
The objective of this study was to analyze the distribution of geological strata through the application of segmentation techniques on geological outcrop images, facilitating a comprehensive understanding of their spatial arrangement. We proposed OutcropHyBNet, which leverage DeepLabv3+ and SegFormer for semantic segmentation, along with OASIS for data augmentation. We conducted evaluations and comparisons of the classification performance and accuracy of both models across different classes using two publicly available benchmark datasets. In our preliminary experiments, we presented compelling evidence of the enhanced performance of DeepLabv3+ in classes heavily reliant on textures, particularly in the context of stuff classes. The superiority of DeepLabv3+ in accurately classifying textures within the dataset was observed to a significant extent, substantiating its effectiveness in such scenarios. In the evaluation experiments using our original datasets, we revealed that for non-standard objects with ambiguous shapes resembling geological strata, where classification depended on texture, CNNs exhibited superiority. Our study revealed that SegFormer outperformed other models in scenarios with limited data availability. Additionally, we identified that imbalanced class distributions had a notable impact on the accuracy of the models. Notably, we found that employing class balancing techniques resulted in enhanced accuracy for DeepLabv3+ compared to SegFormer. Moreover, our findings revealed that the utilization of OASIS for data augmentation significantly contributed to enhanced accuracy. By incorporating OASIS into the training process, we observed improved precision and performance in the classification task, highlighting the effectiveness of data augmentation techniques in enhancing the overall accuracy of the models. In the evaluation experiments conducted on ground-level images obtained using a stationary camera and aerial images obtained using a drone, we successfully demonstrated the superior performance of SegFormer across all classes. The comprehensive analysis revealed that SegFormer consistently outperformed other models in accurately classifying various objects and features present in the aerial images, highlighting its effectiveness and superiority in this specific context.
Our future endeavors encompass several challenges, including the augmentation of diversity through the collection of aerial images from various sources and types. By expanding our dataset to include a broader range of aerial images, we aim to improve the robustness and generalization capabilities of our models. Additionally, we plan to further explore and enhance data augmentation techniques to augment the diversity within the existing dataset, thereby fostering more comprehensive and representative training samples. Moreover, we will explore methods to improve the reproducibility of texture and color in image generation using GANs and DMs. We will also propose annotation modifications based on the inference results to further improve accuracy.