Channel Interaction and Transformer Depth Estimation Network: Robust Self-Supervised Depth Estimation Under Varied Weather Conditions

Liu, Jianqiang; Guo, Zhengyu; Ping, Peng; Zhang, Hao; Shi, Quan

doi:10.3390/su16209131

Open AccessArticle

Channel Interaction and Transformer Depth Estimation Network: Robust Self-Supervised Depth Estimation Under Varied Weather Conditions

by

Jianqiang Liu

¹,

Zhengyu Guo

²,

Peng Ping

^3,*,

Hao Zhang

⁴ and

Quan Shi

³

¹

School of Information Science and Technology, Nantong University, Nantong 226001, China

²

School of Future Technology, South China University of Technology, Guangzhou 510641, China

³

School of Transportation and Civil Engineering, Nantong University, Nantong 226001, China

⁴

Henan Airport Group, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(20), 9131; https://doi.org/10.3390/su16209131

Submission received: 19 September 2024 / Revised: 9 October 2024 / Accepted: 17 October 2024 / Published: 21 October 2024

(This article belongs to the Section Sustainable Transportation)

Download

Browse Figures

Versions Notes

Abstract

:

Monocular depth estimation provides low-cost environmental information for intelligent systems such as autonomous vehicles and robots, supporting sustainable development by reducing reliance on expensive, energy-intensive sensors and making technology more accessible and efficient. However, in practical applications, monocular vision is highly susceptible to adverse weather conditions, significantly reducing depth perception accuracy and limiting its ability to deliver reliable environmental information. To improve the robustness of monocular depth estimation in challenging weather, this paper first utilizes generative models to adjust image exposure and generate synthetic images of rainy, foggy, and nighttime scenes, enriching the diversity of the training data. Next, a channel interaction module and Multi-Scale Fusion Module are introduced. The former enhances information exchange between channels, while the latter effectively integrates multi-level feature information. Finally, an enhanced consistency loss is added to the loss function to prevent the depth estimation bias caused by data augmentation. Experiments on datasets such as DrivingStereo, Foggy CityScapes, and NuScenes-Night demonstrate that our method, CIT-Depth, exhibits superior generalization across various complex conditions.

Keywords:

self-supervised depth estimation; feature enhancement; deep learning

1. Introduction

Depth estimation is a critical environmental feature acquisition method in various domains, including autonomous driving [1], robotics [2], and 3D reconstruction [3]. In autonomous driving, depth estimation enables vehicles to interpret the 3D environment, thereby enhancing decision making and navigation. This enhances road safety while simultaneously promoting sustainable transportation by optimizing vehicle routes, reducing fuel consumption, and minimizing emissions. Similarly, in robotics, depth estimation is utilized to detect terrains, stairs, and room layouts, contributing to more efficient and sustainable automation in various industries. Compared to other depth estimation methods, monocular depth estimation is advantageous due to its low hardware cost, ease of integration, and wide range of application scenarios. Monocular depth estimation is a challenging task in computer vision, requiring the prediction of depth information for every pixel in an image. The approaches to this task can be categorized into supervised and self-supervised methods. Supervised depth estimation has achieved significant success; however, supervised depth estimation requires a large amount of labeled data with precise depth information for training. High-quality deeply labeled data are expensive to obtain, and the collected data lack the diversity needed to cover different scenarios and conditions. To mitigate the reliance on labeled data, self-supervised methods have received widespread attention. Self-supervised depth estimation leverages the characteristics of the input data themselves to learn depth information, without requiring explicit depth annotations. It predicts the relationship between adjacent frames using a depth estimation network and a camera pose estimation network and then employs an image reconstruction loss to measure the difference between the reconstructed and original images, thereby forming a self-supervised signal. This approach utilizes image reconstruction loss as a self-supervised signal, enabling deep learning without labeled data and thereby effectively reducing annotation costs.

Most existing self-supervised depth estimation models rely heavily on CNN-based encoding–decoding structures. Although such structures can effectively extract image features to serve depth estimation, the CNN-based structure lacks the allocation of attention to effective features, and it is difficult to construct the correlation between pixels in different regions, thus losing more depth estimation information. With the sparse feature distribution of severe weather images, this shortcoming can greatly reduce the accuracy of CNN-based depth estimation methods. The Transformer [4] employs self-attention mechanisms to model long-range dependencies between different regions in an image. The self-attention mechanism enables the model to consider the entire image context when processing each position, effectively capturing relationships between distant regions and complex global contextual information. However, with the large resolution inputs commonly used in self-supervised depth estimation, the direct use of Transformers significantly increases computational resource requirements, making the training and inference processes much slower.

The monocular depth estimation performance often degrades under adverse weather conditions such as rain or fog. The training datasets used in existing self-supervised methods focus on good weather or stable lighting conditions, leading to the weak generalization of the models. Monocular depth estimation relies on texture information in images to infer depth, which can be significantly blurred in harsh weather conditions. On rainy days, raindrops cause image blurring, increasing the uncertainty of the depth estimation. This issue is especially critical in fields such as autonomous vehicles and robotics, which are deployed in a wide range of complex environments. The Generative Adversarial Network (GAN) [5] is a powerful generative model capable of producing highly realistic data, and it is widely used in domains such as image generation and image restoration. Due to the differences between generated and real data, directly using generated data for model training may reduce the effectiveness of the enhancement or even introduce inaccurate training signals.

To maintain high-precision depth estimation under complex weather conditions and enhance the understanding of the overall scene, this paper proposes the Channel Interaction and Transformer depth estimation network (CIT-Depth). Both the original images and simulated weather images are utilized as training data. The network backbone consists of convolutional modules and Transformer blocks. CNNs handle local features, while the Transformer models long-range relationships within the image. The model can simultaneously capture both local and global information. A channel interaction module is added between the encoder and decoder to provide a global understanding of the entire environment. During the decoding phase, multi-scale fusion decoding is employed to better leverage feature information and details at different levels. An enhanced consistency loss is incorporated into the loss function. The main contributions of this paper are summarized as follows:

This paper utilizes a combination of original, exposure-adjusted, and enhanced images to strengthen the model’s ability to generalize across different scenarios.
Two key modules are integrated: the channel interaction module improves scene comprehension by integrating information across the preceding five frames, while the Multi-Scale Fusion Module optimizes the network’s use of features at multiple levels.
This paper introduces a mask when calculating the reconstruction loss of the enhanced images to reduce depth estimation bias caused by data augmentation. An enhanced consistency loss is proposed to ensure consistency in depth predictions.

The rest of this paper is organized as follows: Section 2 presents the related works. Section 3 describes the methods used in this paper. Section 4 presents the experimental results. We conclude this paper in Section 5.

2. Related Work

2.1. Supervised Depth Estimation

Initially, depth estimation was treated as a simple regression task using real depth data collected from one or more sensors. With the development of deep learning, Eigen et al. were the first to use CNNs for depth estimation tasks [6]. Subsequent research has built upon this foundation by introducing more complex architectures and loss functions to enhance accuracy and robustness. A fully convolutional residual network architecture was employed, along with a proposed network upsampling method and the introduction of inverse Huber loss for optimization, achieving end-to-end training [7]. Supervised methods convert regression tasks into categorical regression tasks to improve model stability [8,9]. These networks rely on highly accurate labeled data and are difficult to generalize to other datasets.

2.2. Self-Supervised Depth Estimation

Self-supervised methods provide alternatives that do not require large amounts of labeled data. SfM-Learner [10] was the first to achieve self-supervised depth estimation using view synthesis. This method simultaneously trains a Depth Network and a Pose Network, learning both the depth and camera pose by using view synthesis as the network’s supervisory signal. The presence of dynamic objects in an image affects the accuracy of depth estimation. Monodepth2 (MD2) [11] used the pixel-by-pixel minimization of photometric errors to address occlusion issues and introduces auto-masking to handle untextured areas and dynamic objects. A novel self-supervised semantics-guided depth estimation method was proposed by Klingner et al., designed to handle dynamic objects such as moving cars and pedestrians [12]. In addition to monocular videos, stereo pairs can also be used to train depth estimation networks [13]. Garg et al. utilized the known camera displacement and the predicted left image depth map during training, minimizing the difference between the reconstructed left image and the actual left image [14]. To maintain consistency between frames, MonoRec [15] and Manydepth [16] use time-series images to warp reference frame features to target image frames based on multiple predefined depth assumptions.

Many methods aim to improve the generalization ability of self-supervised depth estimation. Zhao et al. proposed an image transfer-based domain adaptation framework to address the performance challenges of self-supervised monocular depth estimation at nighttime [17]. The md4all [18] network is trained on raw images by generating complex samples and combining them with a standard loss function, guiding the model to perform depth estimation under diverse conditions. Spencer et al. proposed a framework that concurrently learns cross-domain dense feature representations and feature consistency-based depth estimation, performing well in nighttime driving scenarios [19]. SAFENet [20] integrated semantic awareness into the representation of deep features by combining semantic and geometric information. Wang et al. used a mapping consistency image enhancement module to boost image visibility and contrast, leading to a better performance on nighttime datasets [21]. While these networks generally perform well under single weather conditions, their performance deteriorates in complex real-world environments with changing weather conditions and varying lighting.

There are also several research efforts aimed at improving the structure of depth estimation networks. Yin et al. replaced VGG networks with ResNet-based encoder–decoder architectures [22]. PackNet [23] used 3D convolution to compress and decompress features through a symmetric packing and unpacking module. With the rise of Transformers, many studies have applied them to computer vision tasks. Dosovitskiy et al. integrated the Transformer architecture into computer vision [24]. MT-SfMLearner [25] combined the Dense Prediction Transformer (DPT) [26] with Monodepth2 modules to enhance the depth prediction performance. Li et al. proposed a parallel encoder architecture that combines Transformers and convolutional networks to separately handle long-range and short-range depth estimation [27]. Hwang et al. used CNNs in the early stages for local feature extraction and ViT in the later stages for global feature extraction, leading to a better model performance and reduced computational complexity [28]. The current research challenge lies in effectively combining CNNs and Transformers to leverage their respective strengths while balancing computational complexity and performance gains.

3. Method

3.1. Data Augmentation for Simulating Weather Variability

In depth estimation tasks, the model must accurately infer structural and depth features within an image. However, since the KITTI dataset primarily contains images captured under sunny and uniformly lit conditions, the model tends to overfit to these specific scenarios. Therefore, enriching the diversity of the dataset is crucial. This paper utilizes GANs to generate diverse weather and lighting effects, thereby augmenting the KITTI dataset. Training on these augmented datasets enables the model to maintain consistent geometric understanding under diverse weather conditions. For example, the model can learn to recognize slight blurriness caused by raindrops or low contrast in scenes affected by fog, which are common in real-world scenarios. By using both the original data and GAN-generated augmented data, the model can learn the features of real-world scenes while also enhancing its understanding of scenes under varying weather and lighting conditions. To better simulate real-world lighting variations, this paper introduces overexposure and underexposure effects into the images. Overexposure is applied to the bright regions, while underexposure is introduced to the dark regions, generating a series of images with varying exposures to mimic different lighting conditions. This method of localized exposure adjustment enhances the overall perceptual quality of the image while carefully preserving the original texture and visual cues, ensuring that the enhancement process does not cause the loss or distortion of key information. To further enhance the depth estimation model’s adaptability to various lighting conditions, this paper uses different overexposure enhancement factors

E_{bright}

and underexposure reduction factors

E_{dark}

to generate a series of images with varying exposure levels.

To generate a KITTI dataset with rainy weather effects, this paper adopts the CycleGAN [29] framework. The CycleGAN network structure includes two generators and two discriminators. Generator G is responsible for mapping the source domain (sunny KITTI images) to the target domain (rainy or foggy images), while Generator F maps the target domain back to the source domain. The architecture of the generator is based on CNNs, comprising convolutional layers, residual blocks, and transposed convolutional layers. Discriminator DX is used to distinguish between real sunny KITTI images and fake sunny images generated by F. Discriminator DY is used to distinguish between real rainy images and fake rainy images generated by G. The cycle consistency loss ensures that the generator retains the essential features and structure of the input image while producing the target domain image. Using the trained CycleGAN model, this paper inputs sunny KITTI images into Generator G to produce rainy and foggy images. This paper uses CoMoGAN [30] to generate scene data under complex lighting conditions such as night and dusk. CoMoGAN is capable of learning nonlinear continuous image transformations, making it particularly suitable for scenarios involving complex lighting variations. CoMoGAN employs functional instance normalization layers. By adjusting the

ϕ

value, the network can smoothly transition from day to dusk and then to night, allowing for precise control over lighting changes. Decoupled residual blocks are used to separate image content from lighting features, making the generated images visually closer to real night and dusk scenes. This paper combines various weather conditions, such as rain + fog and rain + night. To further enhance the model’s robustness, Gaussian noise, snow, and motion blur effects are added to the KITTI dataset.

3.2. Self-Supervised Depth Estimation Network

The proposed network framework consists of a depth estimation network and a pose estimation network, as shown in Figure 1. The depth estimation network predicts scene depth, while the pose estimation network estimates the camera’s relative pose between consecutive frames. The depth estimation network adopts an encoder–decoder architecture. The PoseNet in this paper uses ResNet18 [31] as its backbone network. The network input is created by concatenating the target view with the source view. The network output represents the relative pose between the target view and each source view. If there are N views, the network outputs N-1 relative poses corresponding to the source views. The relative pose is represented by three Euler angles and three translation components.

3.2.1. Channel Interaction Enhanced Depth Encoder

As shown in Figure 1, the deep encoder consists of a Conv-stem block, four CNN and Transformer modules, and a channel interaction module. The input image is processed by a Conv-stem block, which consists of two

3 \times 3

convolutional layers, producing features at a resolution of

\frac{H}{2} \times \frac{W}{2}

.

Image features are progressively extracted through four tandem CNN and Transformer blocks. As illustrated in Figure 2, the Multi-Scale Feature Embedding module stacks varying numbers of

3 \times 3

convolutional layers to capture features with different receptive fields. Features with

3 \times 3

receptive fields are processed by the convolution module, while features with

3 \times 3

,

5 \times 5

, and

7 \times 7

receptive fields are fed into three parallel Transformer modules. The convolution block comprises

1 \times 1

convolution,

3 \times 3

depthwise convolution, and another

1 \times 1

convolution, utilizing residual concatenation to improve information flow and extract local features X. Using the Transformer module proposed by MonoViT [32], the Multi-Scale Feature Embedding segments the image into visual tokens and applies factorized attention [33] to obtain global weighted features:

FactorAtt (Q, K, V) = \frac{Q}{\sqrt{C}} \cdot (softmax (K^{⊤}) V)

(1)

The Hybrid Feature Fusion Module combines the local feature X with the global features

Z_{1}

,

Z_{2}

, and

Z_{3}

and then applies

1 \times 1

convolutions to generate features for the next stage:

Y_{n + 1} = Conv (Concat (X_{n}, Z_{n, 1}, Z_{n, 2}, Z_{n, 3})

(2)

The encoder in the existing network outputs features from the final layer and directly forwards them to the decoder for further processing. This makes it difficult to model long-range relationships between different regions and to effectively capture global information, resulting in the decreased accuracy of depth estimation under adverse weather conditions. To address the above issues, this paper proposes a channel interaction module as shown in Figure 3 between the encoder and decoder. The feature map

Y \in R^{H \times W \times C}

generated by the encoder is reshaped into a two-dimensional matrix

Y^{'} \in R^{N \times C}

. Matrix multiplication is applied between

Y^{'}

and its transposed feature map

{Y^{'}}^{T}

to compute the channel similarity matrix

C S \in R^{C \times C}

:

C S_{m, n} = Y_{m}^{'} \cdot {Y_{n}^{'}}^{T}

(3)

A low channel similarity indicates that the channels capture distinct features. To reduce the potential negative impact of noise from a single frame on channel similarity, this paper averages the channel similarities over the previous five frames:

G S = \frac{1}{T} \sum_{t = 1}^{T} C S_{t}

(4)

where

T = 5

indicates that the average is taken over the preceding five frames. By integrating information from multiple frames, the method more effectively handles dynamic objects and improves global comprehension. Feature maps from different frames capture diverse details of the same scene. In foggy conditions, different frames capture variations in scene depth information due to changes in the fog density. In rainy conditions, the dynamic changes in raindrops can impact scene clarity and the stability of depth estimation. By integrating information from these frames, the model can better adapt to weather changes, thereby improving the accuracy and robustness of depth estimation. Using the softmax function, we obtain the attention matrix

Att \in R^{C \times C}

:

{Att}_{m, n} = \frac{exp ({GS}_{m, n})}{\sum_{n = 1}^{C} exp ({GS}_{m, n})}

(5)

The reshaped feature map and the original feature map are added element-wise to obtain the final output

F \in R^{H \times W \times C}

:

F_{m} = \sum_{n = 1}^{C} ({Att}_{m, n} Y_{n}) + Y_{m}

(6)

The final output features are derived from a weighted fusion of multi-frame channel features, facilitating the improved capture of global contextual information.

3.2.2. Multi-Scale Feature Fusion Depth Decoder

Skip connections directly convey shallow, high-resolution features to the upsampling stage of the decoder. Without proper selection, concatenation causes unimportant information to receive disproportionate attention. By incorporating the Global Channel Pooling (GC) module into the skip connections, as shown in Figure 4, the model enhances the features of important channels. Global average pooling is applied to the feature map

F \in R^{H \times W \times C}

to obtain a global feature vector:

g_{c} = \frac{1}{H \times W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} F_{h, w, c}

(7)

A two-layer fully connected network generates a weight

s_{c} \in R^{1 \times 1 \times C}

for each channel:

s_{c} = σ (W_{2} \cdot ReLU (W_{1} \cdot g))

(8)

where

σ

represents the sigmoid function. We apply the channel weight

s_{c}

to the corresponding channel of the original feature map:

F_{rew} = s_{c} \cdot F

(9)

High-resolution features obtained via skip connections primarily capture low-level information, containing a substantial amount of local texture features. Low-resolution features extracted from deeper layers capture more global contextual information. Most existing networks merely concatenate high-resolution features from skip connections with upsampled low-resolution feature maps. The direct concatenation of features hinders the model’s ability to balance global and local information, adversely affecting its performance in complex weather conditions. To more effectively fuse features from different levels, this paper proposes the Multi-Scale Fusion Module, as illustrated in Figure 5. First, concatenate the high-resolution features

F_{skip}

from the skip connections with the low-resolution features

F_{up}

after upsampling. Then, apply a 3 × 3 convolution followed by batch normalization and the ReLU activation function to obtain

F_{con}

:

F_{con} = ReLU (BN (Conv (Concat (F_{skip}, F_{up}))))

(10)

The concatenated features

F_{con}

are processed by a channel attention module to compute the weight

W_{c} \in R^{1 \times 1 \times 2 C}

for each channel:

W_{c} = ChannelAtt (F_{con})

(11)

The channel weights

W_{c}

are multiplied by the features

F_{con}

to generate the weighted feature map. The introduction of the channel attention module enables the model to emphasize key features based on varying weather conditions. In rainy conditions, the model can emphasize depth differences between the background and foreground, thereby more effectively handling visual disturbances caused by raindrops. In foggy conditions, the model can enhance the recognition of important scene structures, ensuring that key depth information is accurately captured even in low-contrast settings. To highlight key features while maintaining the overall feature, the weighted features are added to the original features

F_{con}

:

F_{fin} = W_{c} ⊙ F_{con} + F_{con}

(12)

The processed features retain the structural information of the original scene while enhancing the model’s understanding of both global and local features under complex weather conditions.

3.3. Loss Function with Enhanced Consistency Loss

Based on the method proposed by Zhou et al. [10], this paper trains a depth estimation network and a Pose Network simultaneously for image reconstruction. Given an input image

I_{t}

, the depth network predicts its depth

D_{t}

. The pose estimation network uses neighboring frames of the input image to predict the relative pose

T_{t \to t^{'}}

between

I_{t}

and

I_{t^{'}}

, where

t^{'} \in {t - 1, t + 1}

. We obtain the reconstructed image

I_{t}

:

I_{t^{'} \to t} = I_{t^{'}} 〈 Proj (D_{t}, P_{t \to t^{'}} K) 〉

(13)

where

Proj ()

projects the depth map

D_{t}

of the target image

I_{t}

into the 2D coordinate system of

I_{t^{'}}

,

〈 〉

denotes the sampling operator, and K is the camera’s intrinsic matrix. The Depth and Pose Networks are trained by minimizing the pixel photometric reprojection loss:

L_{P} = min_{t^{'}} pe (I_{t}, I_{t^{'} \to t})

(14)

Here,

pe

represents the photometric loss, which is the sum of the Structural Similarity Index (SSIM) [34] and L1 loss:

pe = \frac{α}{2} (1 - SSIM (I_{t}, I_{t^{'}})) + (1 - α) {∥ I_{t} - I_{t^{'}} ∥}_{1}

(15)

where

α = 0.85

. The GAN generates individual images without accounting for temporal consistency within the original image sequences, leading to unpredictable variations when enhanced image sequences are directly used for pose estimation. To resolve this issue, this paper inputs the enhanced images only into the depth estimation network, while the pose estimation network continues to use the original images. The enhanced images are reconstructed by

I_{t^{'} \to t}^{aug} = I_{t^{'}}^{aug} 〈 Proj (D_{t}^{aug}, P_{t \to t^{'}}, K) 〉

(16)

Artifacts or noise in certain enhanced images lead to a high reprojection loss, which affects the overall loss calculation. To mitigate excessive errors from enhanced images, an element-wise mask M is incorporated into the loss function to ignore pixels with significant errors in the enhanced images. The element-wise mask independently evaluates each pixel, ensuring that high-error pixels caused by noise or artifacts are effectively disregarded during loss calculation. The model can focus on more reliable pixel regions, thereby mitigating the negative impact of unreliable data on the training process. In adverse weather or complex scenes, this helps the model learn more stably, effectively avoiding biases introduced by artifacts and noise. The reprojection loss of the enhanced images is incorporated into

L_{P}

:

L_{P} = min_{t^{'}} p e (I_{t}, I_{t^{'} \to t}) + M \cdot min_{t^{'}} p e (I_{t}, I_{t^{'} \to t}^{aug})

(17)

The depth information from the original and enhanced images, combined with the same pose estimation, is used to reconstruct the target image. The resulting reconstructed images should be identical. Based on this idea, this paper incorporates the enhanced consistency loss into the final loss function:

L_{C} = min_{t^{'}} p e (I_{t^{'} \to t}, I_{t^{'} \to t}^{aug})

(18)

where

I_{t^{'} \to t}

and

I_{t^{'} \to t}^{aug}

represent the reconstructed images generated from the original image and the augmented image. The enhanced consistency loss helps the model better maintain an understanding of scene structure and consistency by minimizing the error between the augmented reconstructed image and the original reconstructed image. Particularly under complex weather conditions, this loss ensures that the model consistently captures the geometric features of the scene when handling both original and augmented images, thereby reducing uncertainties introduced by data augmentation and enhancing the robustness and accuracy of depth estimation. To maintain continuity in textureless regions of the depth map while preserving details in edge details, this paper uses an edge-aware smoothness loss inspired by previous work [11]:

L_{S} = |\partial_{x} d_{t}^{*}| e^{- |\partial_{x} I_{t}|} + |\partial_{y} d_{t}^{*}| e^{- |\partial_{y} I_{t}|}

(19)

where

d_{t}^{*} = \frac{d_{t}}{{\hat{d}}_{t}}

represents the mean-normalized inverse depth. The final loss function combines the photometric consistency loss, edge-aware smoothness loss, and enhanced consistency loss:

L = μ L_{P} + ω L_{S} + γ L_{C}

(20)

The complete process of our method is shown in Algorithm 1.

Algorithm 1 CIT-Depth: Depth Estimation Network

Components: DepthNet, PoseNet
Loss function: Photometric reprojection loss, Enhanced consistency loss, Edge-aware smoothness loss
if Training Phase then
Input: Image sequence $(I_{t}, I_{t^{'}})$ , Augmented image sequence $(I_{t}^{aug}, I_{t^{'}}^{aug})$
while not converged and not reached max iterations do
$I_{t} \to$ DepthNet ⇒ Depth map $(D_{t})$
$I_{t}^{aug} \to$ DepthNet ⇒ Depth map $(D_{t}^{aug})$
$(I_{t}, I_{t^{'}}) \to$ PoseNet ⇒ Relative pose $(P_{t \to t^{'}})$
Perform reprojection: $I_{t^{'}} \to$ Reproject $(D_{t}, P_{t \to t^{'}}, K) \Rightarrow I_{t^{'} \to t}$
Perform reprojection: $I_{t^{'}}^{aug} \to$ Reproject $(D_{t}^{aug}, P_{t \to t^{'}}, K) \Rightarrow I_{t^{'} \to t}^{aug}$
Apply mask to augmented reprojection loss
Compute photometric reprojection loss $(L_{P})$
Compute edge-aware smoothness loss $(L_{S})$
Compute enhanced consistency loss $(L_{C})$
Total loss: $L = μ L_{P} + ω L_{S} + γ L_{C}$
Update model parameters via backpropagation
end while
else
Inference Phase:
Input: Single input image $(I)$
I ⇒ Predicted depth map $(D)$
Apply post-processing to convert predicted depth map to final depth map $(D_{final})$
Output: Final depth map $(D_{final})$
end if

4. Experiments

4.1. Implementation Details

We implement our CIT-Depth model in PyTorch 1.8.0. The model is trained for 30 epochs on the KITTI dataset, with both the pose encoder and depth encoder initialized using ImageNet [35] pretrained weights. The model uses AdamW [36] as the optimizer, with a batch size of 12, an input image size of

640 \times 192

, and a learning rate set to

10^{- 4}

. The experiments are conducted using a single RTX 4090 GPU (Nvidia: Santa Clara, CA, USA). For evaluation, this paper employs the seven standard metrics proposed in previous work [6].

4.2. Data Augmentation with GAN-Generated Images

To further enhance the model’s generalization ability, this paper uses GANs to generate image data under various weather and time conditions, including rainy days, dusk, and dawn. These synthetic images closely resemble real images and significantly broaden the diversity of the training dataset. By incorporating these complex scenarios into the original dataset, the model can learn a wider range of environmental features during training, resulting in greater robustness in diverse and complex real-world applications. Figure 6 shows some of the enhanced effects. The experimental results indicate that enriching the training data with GAN-generated images improves the model’s depth estimation performance in these specific scenarios.

4.3. Results on KITTI

The KITTI dataset [37] is a widely used benchmark in autonomous driving and computer vision research. It contains typical road scenes, including roads, buildings, trees, cars, and more. This paper uses the Eigen split [6], which includes 4424 images for evaluation and 39,810 images for training. For testing, 697 images containing real LiDAR data are used.

Table 1 shows the performance of existing self-supervised frameworks at a resolution of 640 × 192, comparing our model with other self-supervised methods under monocular training. Our method outperforms most comparable methods. Figure 7 shows the performance results of our method and other methods on the KITTI dataset, demonstrating that CIT-Depth performs better in various scenes. CIT-Depth demonstrates a more balanced performance across various scenes, particularly exhibiting smoother depth transitions in complex edges and occlusion areas. It also shows better continuity when handling distant backgrounds, avoiding abrupt depth changes or discontinuities. This indicates that CIT-Depth is well adapted to a variety of scenes.

Table 1. Results on the KITTI dataset using the Eigen split. All methods were pretrained on ImageNet. “M” denotes training with monocular video, while “M + Se” denotes training with stereo video. The optimal value for each metric is indicated in bold font.

Method	Data	Resolution	Abs Rel	Sq Rel	RMSE	RMSE log	$δ_{1} < 1.25$	$δ_{2} < {1.25}^{2}$	$δ_{3} < {1.25}^{3}$
SfMLearner [10]	M	640 × 192	0.183	1.595	6.709	0.270	0.734	0.902	0.959
Monodepth2-Res18 [11]	M	640 × 192	0.115	0.903	4.863	0.193	0.877	0.959	0.981
Monodepth2-Res50 [11]	M	640 × 192	0.110	0.830	4.642	0.189	0.882	0.961	0.981
SGDepth [12]	M + Se	640 × 192	0.113	0.832	4.691	0.190	0.880	0.961	0.981
PackNet-SfM [23]	M	640 × 192	0.111	0.784	4.601	0.189	0.877	0.960	0.982
HR-Depth [38]	M	640 × 192	0.109	0.791	4.633	0.186	0.884	0.961	0.983
ADAADepth [39]	M	640 × 192	0.111	0.815	4.684	0.187	0.883	0.961	0.982
BRNet [40]	M	640 × 192	0.105	0.699	4.465	0.180	0.888	0.963	0.984
CADepth [41]	M	640 × 192	0.105	0.765	4.535	0.181	0.891	0.964	0.983
DIFFNet [42]	M	640 × 192	0.102	0.750	4.447	0.179	0.896	0.965	0.983
MonoViT [32]	M	640 × 192	0.099	0.708	4.374	0.175	0.900	0.967	0.984
CIT-Depth (Ours)	M	640 × 192	0.097	0.655	4.214	0.172	0.902	0.968	0.984

Figure 7. Qualitative results on the KITTI dataset. The first row shows the input image, followed by the predicted results of Monodepth2 [11], HR-Depth [38], DIFFNet [42], MonoViT [32], and CIT-Depth (ours).

4.4. Results on Make3D

To evaluate the model’s generalization capability, we perform tests on the Make3D dataset [43], which includes 134 test images of different outdoor scenes. Compared to the KITTI dataset, the Make3D dataset features diverse structures and terrain variations, such as building facades, trees, and lawns. These scenes typically exhibit significant depth gradient changes, making them suitable for testing a depth estimation model’s ability to handle large-scale depth variations.

The model trained on the KITTI dataset is evaluated on the Make3D dataset without any further modifications. As shown in Table 2, the CIT-Depth model exhibits better adaptability to various scenes compared to other methods. Figure 8 presents the qualitative results, showing that our model produces clearer results in outdoor scenes. Other models perform poorly in certain areas, particularly around the edges of distant buildings or trees, where depth predictions exhibit discontinuities or blurriness. CIT-Depth provides more continuous and smooth depth predictions at the edges of distant buildings and trees, with more accurate edge handling. This indicates that the method has a stronger ability to handle scene details and large-scale depth variations.

4.5. Results on DrivingStereo

The DrivingStereo dataset [44] is a large-scale stereo vision dataset that includes images captured under various weather conditions, such as sunny, rainy, cloudy, and foggy weather. The dataset features a diverse range of scenes, including city roads, highways, and rural roads, making it highly valuable for addressing the diversity and complexity of real-world driving scenarios.

To verify the model’s generalization capability under various weather conditions, the model is tested on the DrivingStereo dataset. Table 3 summarizes the evaluation results of various state-of-the-art methods. All the models are trained solely on the KITTI dataset and tested directly under different weather conditions. CIT-Depth outperforms other methods in rainy, foggy, and cloudy conditions, demonstrating the robustness of the method. Figure 9 shows that while other models perform well under sunny and cloudy conditions, they exhibit significant depth-blurring issues in rainy and foggy conditions. In contrast, CIT-Depth effectively captures edge details and depth information in distant scenes, particularly under complex conditions such as rain and fog. It performs well across most weather conditions and demonstrates a strong performance in handling low contrast and visual noise, providing more accurate and consistent depth predictions for distant objects.

4.6. Results on Foggy CityScape

The Foggy CityScapes dataset [45], derived from the original CityScapes dataset [46], is specifically designed for researching the performance of autonomous driving systems in foggy weather conditions. By overlaying fog effects on the original CityScapes images, this dataset simulates varying levels of foggy weather, creating realistic foggy scenes. These images retain the complex structure of urban streets, including roads, buildings, pedestrians, and vehicles, but the contrast and clarity are significantly reduced due to the fog.

Table 4 shows that CIT-Depth also performs well on the foggy dataset. Figure 10 illustrates that, due to the effect of fog, models like Monodepth2 exhibit noticeable blurriness and discontinuities in depth estimation for distant objects. This is particularly evident when dealing with trees and buildings at a distance, where the model struggles to accurately predict depth. CIT-Depth outperforms other methods in foggy conditions, maintaining higher accuracy in depth estimation for distant objects, with the depth map displaying better smoothness and consistency. In regions with distant trees and buildings, edge handling is notably smoother. However, under extreme foggy conditions, the model still shows slight blurriness in depth prediction for distant objects.

4.7. Results on NuScene-Night

The NuScenes-Night dataset [47], derived from the NuScenes dataset, is specifically intended for researching autonomous driving perception tasks in nighttime conditions. The dataset includes a variety of nighttime driving scenes, covering city streets, suburban roads, and highways. The images were primarily collected under nighttime or dim lighting conditions, characterized by extremely low light and high-contrast areas.

Table 5 shows that our model maintains a strong performance even at night, outperforming some models specifically designed for nighttime data. As shown in Figure 11, Monodepth2 exhibits noticeable blurriness and discontinuities in depth predictions in poorly lit areas, especially at the edges of roads and for distant objects. Other models struggle to effectively handle the high noise and low contrast caused by insufficient lighting, leading to a loss of detail in the depth map. CIT-Depth exhibits better smoothness and continuity in low-light and high-contrast areas. Under extreme low-light conditions, the model encounters challenges in predicting the depth of distant objects, often leading to considerable blurring. This blurring is primarily caused by the low signal-to-noise ratio in images captured by the camera at night, where useful information is overwhelmed by noise. Although the proposed method demonstrates an improved performance at night, there remains an issue of insufficient information when handling fine-grained depth prediction for distant objects.

Figure 11. Qualitative results on the NuScene-Night dataset. The depth prediction results of Monodepth2 [11], ADDS-Depth [48], MonoViT [32], and CIT-Depth under nighttime conditions.

Table 5. Results on the NuScene-Night dataset. ADDS-DepthNet was trained on the RobotCar dataset [49]. The optimal value for each metric is indicated in bold font.

Method	Abs Rel	Sq Rel	RMSE	RMSE log	$δ_{1} < 1.25$	$δ_{2} < {1.25}^{2}$	$δ_{3} < {1.25}^{3}$
Monodepth2 [11]	0.398	6.210	14.571	0.568	0.378	0.650	0.794
HR-Depth [38]	0.460	6.635	15.028	0.622	0.305	0.570	0.749
CADepth [41]	0.421	5.950	14.504	0.593	0.311	0.613	0.776
DIFFNet [42]	0.344	4.853	13.154	0.491	0.440	0.710	0.838
ADDS-Depth [48]	0.321	4.594	12.909	0.479	0.466	0.711	0.840
MonoViT [32]	0.313	4.144	12.255	0.456	0.484	0.736	0.858
CIT-Depth (Ours)	0.307	4.080	11.591	0.431	0.539	0.781	0.863

4.8. Ablation Study

To evaluate the impact of the channel interaction module (CIM) and the Multi-Scale Fusion Module (MSFM) on model performance, we conducted ablation studies to verify the contributions of these two modules to the depth estimation task under complex weather conditions. First, we used a baseline model, then incrementally added the CIM and MSFM, and finally combined both modules. We tested different model configurations on multiple datasets, encompassing various weather conditions such as sunny, rainy, and foggy scenarios. The results in Table 6 indicate that the addition of these two modules yields limited improvement on the KITTI dataset but demonstrates noticeable enhancement on other datasets. The channel interaction module enhances interaction between feature channels, enabling the model to more effectively identify key features in the scene. The Multi-Scale Fusion Module plays a crucial role in fusing multi-scale features, enhancing the model’s ability to capture fine details. When both the CIM and MSFM are incorporated, the overall model performance reaches its optimum, significantly reducing errors and improving accuracy metrics, particularly under extreme weather conditions. This indicates that the two modules exhibit strong synergy in handling diverse weather variations, jointly enhancing the model’s depth estimation accuracy and robustness.

5. Conclusions

This paper proposed CIT-Depth, a new self-supervised monocular depth estimation network suitable for various weather conditions. The network backbone combines convolutional modules and Transformer modules and incorporates additional weather images during training. Channel fusion modules are added during the encoding and decoding stages to enhance the model’s adaptability to depth estimation in different scenarios. Our method achieves excellent results on the KITTI dataset. Experiments on the Foggy CityScape, DrivingStereo, and NuScene-Night datasets demonstrate that our model outperforms state-of-the-art architectures under various conditions (rain, fog, cloudy, and night). The proposed method for depth prediction of distant objects under extreme low-light conditions has room for improvement. In the future, integrating light enhancement techniques or specifically optimizing the feature extraction module will be necessary to address the challenge of insufficient information in complex environments. Although the robustness of the CIT-Depth model was validated under various weather conditions, future work will focus on conducting additional tests on datasets from diverse geographic regions and more extreme environmental variations to provide a more comprehensive evaluation of the model’s performance. To provide a more comprehensive evaluation of the model’s performance, future work will focus on conducting additional tests on datasets from diverse geographic regions and extreme environmental conditions. We will further optimize computational efficiency through techniques such as pruning and quantization to reduce model complexity while maintaining performance, making it more suitable for deployment on embedded devices.

Author Contributions

J.L.: draft preparation, model development, and manuscript revision; Z.G.: editing, software, and data curation; P.P.: overall planning, draft preparation, and manuscript review; H.Z.: supervision, project administration, and conceptualization; Q.S.: supervision, project administration, investigation, and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grant 52202496 and 62476145, the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province under grant 22KJB510040, and the Nantong social livelihood science and technology project under grant MS12022015.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The KITTI public dataset in this paper is available at http://www.cvlibs.net/datasets/kitti/ (accessed on 18 September 2024). The Make3D public dataset in this paper is available at http://make3d.cs.cornell.edu (accessed on 18 September 2024). The DrivingStereo public dataset in this paper is available at http://drivingstereo-dataset.github.io (accessed on 18 September 2024). The Foggy CityScape public dataset in this paper is available at https://www.cityscapes-dataset.com/benchmarks/#foggy-cityscapes (accessed on 18 September 2024). The NuScene-Night public dataset in this paper is available at https://www.nuscenes.org/nuscenes (accessed on 18 September 2024).

Conflicts of Interest

Author Hao Zhang was employed by Henan Airport Group. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Afshar, M.F.; Shirmohammadi, Z.; Ghahramani, S.A.A.G.; Noorparvar, A.; Hemmatyar, A.M.A. An Efficient Approach to Monocular Depth Estimation for Autonomous Vehicle Perception Systems. Sustainability 2023, 15, 8897. [Google Scholar] [CrossRef]
Ebner, L.; Billings, G.; Williams, S. Metrically scaled monocular depth estimation through sparse priors for underwater robots. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 3751–3757. [Google Scholar]
Jia, Q.; Chang, L.; Qiang, B.; Zhang, S.; Xie, W.; Yang, X.; Sun, Y.; Yang, M. Real-time 3D reconstruction method based on monocular vision. Sensors 2021, 21, 5909. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A. Attention is all you need. In Proceedings of the International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2014; pp. 2366–2374. [Google Scholar]
Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2002–2011. [Google Scholar]
Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4009–4018. [Google Scholar]
Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
Klingner, M.; Termöhlen, J.-A.; Mikolajczyk, J.; Fingscheidt, T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 582–600. [Google Scholar]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
Garg, R.; Vijay Kumar, B.G.; Carneiro, G.; Reid, I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 740–756. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.-C. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 5463–5474. [Google Scholar]
Zhao, C.; Tang, Y.; Sun, Q. Unsupervised monocular depth estimation in highly complex environments. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 1237–1246. [Google Scholar] [CrossRef]
Gasperini, S.; Morbitzer, N.; Jung, H.; Navab, N.; Tombari, F. Robust monocular depth estimation under challenging conditions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 8177–8186. [Google Scholar]
Spencer, J.; Bowden, R.; Hadfield, S. Defeat-net: General monocular depth via simultaneous unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14402–14413. [Google Scholar]
Choi, J.; Jung, D.; Lee, D.; Kim, C. Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Wang, K.; Zhang, Z.; Yan, Z.; Li, X.; Xu, B.; Li, J.; Yang, J. Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 16055–16064. [Google Scholar]
Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1983–1992. [Google Scholar]
Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3D packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2485–2494. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Varma, A.; Chawla, H.; Zonooz, B.; Arani, E. Transformers in self-supervised monocular depth estimation with unknown camera intrinsics. arXiv 2022, arXiv:2202.03131. [Google Scholar]
Lasinger, K.; Ranftl, R.; Schindler, K.; Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. arXiv 2019, arXiv:1907.01341. [Google Scholar]
Li, Z.; Chen, Z.; Liu, X.; Jiang, J. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. Mach. Intell. Res. 2023, 20, 837–854. [Google Scholar] [CrossRef]
Hwang, S.-J.; Park, S.-J.; Baek, J.-H.; Kim, B. Self-supervised monocular depth estimation using hybrid transformer encoder. IEEE Sens. J. 2022, 22, 18762–18770. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Pizzati, F.; Cerri, P.; De Charette, R. CoMoGAN: Continuous model-guided image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 14288–14298. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhao, C.; Zhang, Y.; Poggi, M.; Tosi, F.; Guo, X.; Zhu, Z.; Huang, G.; Tang, Y.; Mattoccia, S. Monovit: Self-supervised monocular depth estimation with a vision transformer. In Proceedings of the 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, 17–19 October 2022; pp. 668–678. [Google Scholar]
Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. MPViT: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 7287–7296. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. HR-Depth: High resolution self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, Number 3. pp. 2294–2301. [Google Scholar]
Kaushik, V.; Jindgar, K.; Lall, B. ADAADepth: Adapting data augmentation and attention for self-supervised monocular depth estimation. IEEE Robot. Autom. Lett. 2021, 6, 7791–7798. [Google Scholar] [CrossRef]
Han, W.; Yin, J.; Jin, X.; Dai, X.; Shen, J. Brnet: Exploring comprehensive features for monocular depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–28 August 2022; pp. 586–602. [Google Scholar]
Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-wise attention-based network for self-supervised monocular depth estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), Virtual, 18–21 October 2021; pp. 464–473. [Google Scholar]
Zhou, H.; Greenwood, D.; Taylor, S. Self-supervised monocular depth estimation with internal feature fusion. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 22–25 November 2021. [Google Scholar]
Saxena, A.; Sun, M.; Ng, A.Y. Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 824–840. [Google Scholar] [CrossRef] [PubMed]
Yang, G.; Song, X.; Huang, C.; Deng, Z.; Shi, J.; Zhou, B. DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 899–908. [Google Scholar]
Sakaridis, C.; Dai, D.; Van Gool, L. Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vis. 2018, 126, 973–992. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11621–11631. [Google Scholar]
Liu, L.; Song, X.; Wang, M.; Liu, Y.; Zhang, L. Self-supervised monocular depth estimation for all day images using domain separation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12737–12746. [Google Scholar]
Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 year, 1000 km: The Oxford RobotCar dataset. Int. J. Robot. Res. 2017, 36, 3–15. [Google Scholar] [CrossRef]

Figure 1. Overview of our CIT-Depth architecture. Our CIT-Depth is composed of two parts: Depth Network and Pose Network [11]. The Depth Network utilizes both convolutional layers and Transformer architecture.

Figure 2. The depth encoder employs CNN and Transformer module. Each module consists of a convolutional module and three Transformer modules.

Figure 3. Structures of channel interaction module.

Figure 4. Structures of Global Channel Pooling module.

Figure 5. Structure of Multi-Scale Fusion Module.

Figure 6. The enhanced images generated by GAN under various weather and time conditions are displayed in the following order from top to bottom: original input image, mixed exposure image, rainy image, foggy image, and nighttime image.

Figure 8. Qualitative results on the Make3D dataset. CIT-Depth is compared with the predicted results of Monodepth2 [11] and MonoViT [32].

Figure 9. A demonstration of the qualitative results from the DrivingStereo dataset. From left to right are images from the dataset representing sunny, rainy, foggy, and cloudy conditions.

Figure 10. Qualitative results on the Foggy CityScape dataset. The depth prediction results of Monodepth2 [11], DIFFNet [42], MonoViT [32], and CIT-Depth under foggy conditions.

Table 2. Results on the Make3D dataset. Models trained on KITTI with 640 × 192 images. The optimal value for each metric is indicated in bold font.

Method	Abs Rel	Sq Rel	RMSE	RMSE log
Monodepth2 [11]	0.321	3.377	7.417	0.164
HR-Depth [38]	0.315	3.208	7.031	0.155
CADepth [41]	0.318	3.223	7.151	0.158
DIFFNet [42]	0.299	2.910	6.760	0.153
MonoViT [32]	0.286	2.759	6.625	0.147
CIT-Depth (Ours)	0.275	2.639	6.402	0.144

Table 3. All models were trained on the KITTI dataset and tested under four weather conditions (foggy, cloudy, rainy, and sunny). The optimal value for each metric is indicated in bold font.

Domain	Method	Abs Rel	Sq Rel	RMSE	RMSE log	$δ_{1} < 1.25$	$δ_{2} < 1 . 25^{2}$	$δ_{3} < 1 . 25^{3}$
foggy	Monodepth2 [11]	0.143	1.954	9.818	0.218	0.812	0.936	0.974
	HR-Depth [38]	0.132	1.826	9.587	0.198	0.826	0.949	0.982
	CADepth [41]	0.141	1.779	9.450	0.208	0.811	0.945	0.981
	DIFFNet [42]	0.126	1.562	8.724	0.189	0.839	0.956	0.985
	MonoViT [32]	0.106	1.155	7.256	0.161	0.871	0.970	0.990
	CIT-Depth (Ours)	0.104	1.147	7.213	0.159	0.880	0.975	0.992
cloudy	Monodepth2 [11]	0.155	1.902	6.977	0.209	0.812	0.943	0.979
	HR-Depth [38]	0.148	1.657	6.659	0.204	0.815	0.945	0.981
	CADepth [41]	0.148	1.805	6.712	0.205	0.829	0.947	0.981
	DIFFNet [42]	0.140	1.572	6.298	0.192	0.837	0.950	0.983
	MonoViT [32]	0.135	1.469	6.095	0.183	0.857	0.955	0.985
	CIT-Depth (Ours)	0.133	1.456	5.912	0.180	0.860	0.957	0.986
rainy	Monodepth2 [11]	0.240	3.339	11.042	0.301	0.590	0.587	0.953
	HR-Depth [38]	0.222	2.962	10.495	0.281	0.631	0.869	0.959
	CADepth [41]	0.226	3.015	10.825	0.287	0.629	0.851	0.956
	DIFFNet [42]	0.192	2.411	9.626	0.246	0.677	0.914	0.969
	MonoViT [32]	0.174	2.132	9.490	0.231	0.728	0.928	0.976
	CIT-Depth (Ours)	0.170	2.015	9.023	0.220	0.736	0.935	0.979
sunny	Monodepth2 [11]	0.178	2.105	8.209	0.240	0.782	0.925	0.968
	HR-Depth [38]	0.164	1.839	7.890	0.227	0.794	0.936	0.975
	CADepth [41]	0.162	1.755	7.689	0.221	0.801	0.936	0.974
	DIFFNet [42]	0.150	1.616	7.580	0.210	0.812	0.940	0.978
	MonoViT [32]	0.142	1.457	7.007	0.199	0.832	0.948	0.981
	CIT-Depth (Ours)	0.143	1.459	7.009	0.199	0.833	0.949	0.981

Table 4. Results on the Foggy CityScape dataset. All models were trained on the KITTI dataset. The optimal value for each metric is indicated in bold font.

Method	Abs Rel	Sq Rel	RMSE	RMSE log	$δ_{1} < 1.25$	$δ_{2} < {1.25}^{2}$	$δ_{3} < {1.25}^{3}$
Monodepth2 [11]	0.208	3.095	12.449	0.337	0.656	0.842	0.917
HR-Depth [38]	0.213	3.015	12.267	0.336	0.642	0.841	0.920
CADepth [41]	0.207	2.738	11.550	0.318	0.650	0.856	0.933
DIFFNet [42]	0.187	2.583	11.337	0.304	0.689	0.867	0.937
MonoViT [32]	0.155	1.873	9.585	0.244	0.771	0.910	0.967
CIT-Depth (Ours)	0.151	1.689	8.092	0.229	0.798	0.939	0.970

Table 6. Performance evaluation of different model configurations across various datasets. The baseline model is compared with configurations using CIM, MSFM, and their combination.

Dataset	Model Configuration	Abs Rel	Sq Rel	RMSE	RMSE log	$δ_{1} < 1.25$	$δ_{2} < 1 . 25^{2}$	$δ_{3} < 1 . 25^{3}$
KITTI	Baseline	0.099	0.701	4.409	0.177	0.896	0.965	0.983
	+CIM	0.098	0.689	4.298	0.174	0.898	0.967	0.984
	+MSFM	0.098	0.698	4.341	0.175	0.899	0.967	0.984
	+CIM+MSFM	0.097	0.655	4.214	0.172	0.902	0.968	0.984
DrivingStereo (foggy)	Baseline	0.108	1.201	7.310	0.165	0.865	0.967	0.990
	+CIM	0.105	1.155	7.259	0.163	0.869	0.969	0.991
	+MSFM	0.105	1.150	7.255	0.162	0.871	0.970	0.991
	+CIM+MSFM	0.104	1.147	7.213	0.159	0.880	0.975	0.992
DrivingStereo (rainy)	Baseline	0.175	2.159	9.492	0.232	0.725	0.926	0.976
	+CIM	0.172	2.122	9.251	0.225	0.732	0.931	0.978
	+MSFM	0.173	2.138	9.319	0.230	0.729	0.929	0.978
	+CIM+MSFM	0.170	2.015	9.023	0.220	0.736	0.935	0.979
Foggy CityScape	Baseline	0.156	1.877	9.592	0.247	0.770	0.909	0.966
	+CIM	0.152	1.711	8.301	0.235	0.785	0.928	0.968
	+MSFM	0.154	1.851	9.204	0.240	0.778	0.916	0.968
	+CIM+MSFM	0.151	1.689	8.092	0.229	0.798	0.939	0.970
NuScene-Night	Baseline	0.315	4.148	12.260	0.457	0.482	0.739	0.859
	+CIM	0.310	4.098	11.803	0.440	0.519	0.769	0.862
	+MSFM	0.312	4.121	12.009	0.454	0.501	0.752	0.860
	+CIM+MSFM	0.307	4.080	11.591	0.431	0.539	0.781	0.863

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Guo, Z.; Ping, P.; Zhang, H.; Shi, Q. Channel Interaction and Transformer Depth Estimation Network: Robust Self-Supervised Depth Estimation Under Varied Weather Conditions. Sustainability 2024, 16, 9131. https://doi.org/10.3390/su16209131

AMA Style

Liu J, Guo Z, Ping P, Zhang H, Shi Q. Channel Interaction and Transformer Depth Estimation Network: Robust Self-Supervised Depth Estimation Under Varied Weather Conditions. Sustainability. 2024; 16(20):9131. https://doi.org/10.3390/su16209131

Chicago/Turabian Style

Liu, Jianqiang, Zhengyu Guo, Peng Ping, Hao Zhang, and Quan Shi. 2024. "Channel Interaction and Transformer Depth Estimation Network: Robust Self-Supervised Depth Estimation Under Varied Weather Conditions" Sustainability 16, no. 20: 9131. https://doi.org/10.3390/su16209131

APA Style

Liu, J., Guo, Z., Ping, P., Zhang, H., & Shi, Q. (2024). Channel Interaction and Transformer Depth Estimation Network: Robust Self-Supervised Depth Estimation Under Varied Weather Conditions. Sustainability, 16(20), 9131. https://doi.org/10.3390/su16209131

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Channel Interaction and Transformer Depth Estimation Network: Robust Self-Supervised Depth Estimation Under Varied Weather Conditions

Abstract

1. Introduction

2. Related Work

2.1. Supervised Depth Estimation

2.2. Self-Supervised Depth Estimation

3. Method

3.1. Data Augmentation for Simulating Weather Variability

3.2. Self-Supervised Depth Estimation Network

3.2.1. Channel Interaction Enhanced Depth Encoder

3.2.2. Multi-Scale Feature Fusion Depth Decoder

3.3. Loss Function with Enhanced Consistency Loss

4. Experiments

4.1. Implementation Details

4.2. Data Augmentation with GAN-Generated Images

4.3. Results on KITTI

4.4. Results on Make3D

4.5. Results on DrivingStereo

4.6. Results on Foggy CityScape

4.7. Results on NuScene-Night

4.8. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI