1. Introduction
Multi-View Stereo (MVS) is one of the fundamental tasks of 3D computer vision, which is centred on the use of camera parameters and viewpoint poses to compute the mapping relationship of each pixel in an image for dense 3D scene reconstruction. In recent years, this technology has found widespread applications in areas such as robot navigation, cultural heritage preservation through digitization, and autonomous driving. Traditional methods heavily rely on manually designed similarity metrics for reconstruction [
1,
2,
3,
4]. While these approaches perform well in Lambertian surface scenarios, their effectiveness diminishes in challenging conditions characterized by complex lighting variations, lack of distinct textures, and non-Lambertian surfaces. Furthermore, these methods suffer from computational inefficiency, significantly increasing the time required for reconstructing large-scale scenes, thereby limiting practical applications.
Deep learning-based MVS methods, such as Yao et al. [
5] employed 2D Convolutional Neural Networks (CNN) to extract image features. They utilized differentiable homography warping, 3D CNN regularization, and depth regression operations to achieve end-to-end depth map prediction. Finally, the reconstructed dense scene was obtained through depth map fusion. The introduction of CNN networks allows for better extraction of global features, with excellent performance even in scenarios with weak textures and reflective environments. Additionally, Gu et al. [
6], in CasMVSNet, adopt a cascading approach to construct the cost volume, gradually refining the depth value sampling range from coarse to fine. This stepwise refinement at higher feature resolutions generates more detailed depth maps, ensuring overall efficiency in reconstruction and a rational allocation of computational resources. However, conventional multi-stage MVS frameworks often lack flexibility in depth sampling, relying mostly on static or pre-defined ranges for depth value sampling. In cases where there is a deviation in depth sampling for a certain pixel, the model cannot adaptively adjust the sampling range for the next stage, leading to erroneous depth inferences.
The core step of multi-view stereo vision is to construct a 3D cost volume, which can be summarized as computing the similarity between multi-view images. Existing methods mostly utilize variance [
6,
7] to build the 3D cost volume. For example, Yao et al. [
5] havethe same weights for different perspectives in matching cost volume construction and use the mean square deviation method to aggregate feature volumes from different perspectives. However, this approach overlooks pixel visibility under different viewpoints, limiting its effectiveness in dense pixel-wise matching. To address this issue, Wei et al. [
8] introduced context-aware convolution in the AA-RMVSNet’s intra-view aggregation module to aggregate feature volumes from different viewpoints. Additionally, Yi et al. [
9] proposed an adaptive view aggregation module, utilizing deformable convolution networks to achieve pixel-wise and voxel-wise view aggregation with minimal memory consumption. Luo et al. [
10] employed a learning-based block matching aggregation module, transforming individual volume pixels into pixel blocks of a certain size and facilitating information exchange at different depths. However, directly applying regularization to the cost volume fails to facilitate communication with depth feature information from adjacent depths. With the continuous development of attention mechanisms, Yu et al. [
11] incorporated attention mechanisms into the feature extraction stage of the MVS network, resulting in noticeable improvements in experimental results. Li et al. [
12] transformed the depth estimation problem into a correspondence problem between sequences and optimized it through self-attention and cross-attention. Unfortunately, the above methods focus solely on a single dimension, addressing only 2D local similarity issues and obtaining pixel weights through complex networks. This introduces additional computational overhead, neglecting the correlation between 2D semantics and 3D space, ultimately compromising the assurance of 3D consistency in the depth direction.
To address the above issues, this paper proposes an uncertainty-epipolar Transformer multi-view stereo network (U-ETMVSNet) for object stereo reconstruction. First, this paper uses an improved cascaded U-Net network to enhance the extraction of 2D semantic features. And the cross-attention mechanism of the epipolar Transformer is used to construct the 3D association between different view feature volumes along the epipolar lines, enhancing the 3D consistency of the depth space, without introducing additional learning parameters to increase the amountof model calculations. The cross-scale cost volume information exchange module allows information contained in cost volumes at different stages to be progressively transmitted, strengthening the correlation between cost volumes and improving the quality of depth map estimation. Secondly, allynamic dynamic adjusting the depth sampling range based on the uncertainty of the probability cost volume is employed to effectively reduce the requirements on the number of depth samples and enhance the accuracy of depth sampling. Finally, a multi-stage joint learning approach is proposed, replacing the conventional depth regression problem at each stage with a multi-depth value classification problem. This joint learning strategy significantly enhances the precision of the reconstruction. The proposed method is experimentally validated on the DTU and Tanks&Temples datasets, and its performance is compared with current mainstream methods. The method in this paper achieves high reconstruction accuracy even at lower sampling rates, confirming the effectiveness of the proposed approach for dense scene reconstruction.
The rest of the paper is organised as follows:
Section 2 provides an overview of relevant methods in the field.
Section 3 provides a detailed overview of the proposed network and the entire process of object reconstruction.
Section 4 presents the experimental setup and multiple experiments conducted to validate the reliability and generalization capabilities of the proposed method. Finally, in
Section 5, we summarize the contributions of the proposed network to multi-view reconstruction of objects offer prospects for future work.
3. Method
In this section, we provide a detailed overview of the model proposed in this paper. The overall network architecture is depicted in
Figure 1. The network processes the given image
, utilizing an enhanced Cascaded U-Net to extract 2D features at various scales (
Section 3.1). Subsequently, we employ a differentiable homography warping to construct the source view feature volume, initializing depth hypotheses through inverse depth sampling in the initial stage (
Section 3.2). The epipolar Transformer (ET) is then utilized to aggregate feature volumes from different viewpoints, generating stage-wise matching cost volume. The cost volume information exchange module (CVIE) enhances the utilization of information across different scales (
Section 3.3). In stage 1 of the model, we dynamically adjust the depth sampling range based on the uncertainty in the current probability cost volume distribution, aiming to enhance the accuracy of depth inference (
Section 3.4). Finally, we introduce the multi-stage joint learning approach proposed in this paper (
Section 3.5).
3.1. Cascaded U-Net Network
Traditional methods, such as Yao et al. [
5], employ 2D convolutional networks for feature extraction, however, this approach can only perceiv image textures within a fixed field of view. In contrast, Chen et al. [
33] utilize an improved U-Net network for feature extraction, achieving favorable results. In this section, an enhanced cascaded U-Net feature extraction module is designed. The first part of the structure is illustrated in
Figure 2. The network selectively handles low-texture regions to preserve more intricate details.
The given reference image and adjacent source images are fed into the network to construct image features at different scales. In this cascaded U-Net network, the front-end feature encoder utilizes successive convolution and pooling operations, increasing the channel dimensions while reducing the size to extract deep features from the images. However, as the network depth increases, more feature information tends to be lost. The back-end decoder functions inversely to the encoder, performing upsampling to not only restore the original size but also connecting with feature maps from earlier stages. This facilitates better reconstruction of target details. The key to this process lies in fusing high-level and low-level features to enrich the detailed information in the feature maps. Subsequently, the cascaded network repeats this process, and the second part of the cascaded structure appends convolution operations at the output ports, obtaining features , where denotes the three different stages of the model, omitted for simplicity in the following discussion. This cascaded U-Net feature extraction module aids in preserving richer detailed features, providing more accurate information for depth estimation.
3.2. Homography Warping
In deep learning-based multi-view stereo (MVS) methods [
27,
29,
30], the construction of the cost volume often involves the use of differentiable homographic warping, drawing inspiration from traditional plane sweep stereo. Homographic warping leverage camera parameters to establish mappings between each pixel on the source view and different depths under the reference view, within a depth range of
. The procedure entails warping source image features to the
layer of the reference view’s viewing frustum. This process is mathematically expressed as shown in Equation (1).
The pixel feature in the reference view is denoted as
. We use
to represent the camera intrinsic parameters and
to represent the motion transformation parameters from the source views to the reference view. By embedding camera parameters into features and performing mapping transformations, we establish the mapping relationship between the pixel feature
in the i-th source view corresponding to
. The features
are distributed along the epipolar line in the source view, and the depth features of layer
can be represented as
. Simultaneously,
N − 1 feature volumes
are generated, where
D is the total number of depth hypotheses. This process accomplishes the conversion of features from two-dimensional to three-dimensional, thereby restoring depth information.
Due to the absence of calibration in the input images, directly performing uniform sampling in depth space may result in spatial sampling points not being evenly distributed along the epipolar lines when projected onto the reference view. This is particularly evident in regions farther from the camera center, where the mapped features may be very close, leading to a loss of depth information, as illustrated in
Figure 3. To address this issue, inspired by references [
29,
34], in the first stage of depth sampling, this paper employs the inverse depth sampling method to initialize depth. The specific operation involves uniformly sampling in inverse depth space, ensuring equidistant sampling in pixel space, as shown in Equation (2).
Employing this depth sampling method effectively avoids the loss of depth information, thereby significantly enhancing the reconstruction results.
3.3. Cost Volume Aggregation
The complete cost volume aggregation module consists of two components: the Epipolar Transformer aggregation module (ET) (
Section 3.3.1) and the cross-scale cost volume information exchange module (CVIE) (
Section 3.3.2). In this section, we will introduce both components.
3.3.1. Epipolar Transformer
Cost volume construction is the process of aggregating feature volumes from different source views to obtain depth information for individual pixels in the reference view. As conventional variance-based aggregation methods often struggle to filter out noise effectively, this paper employs an epipolar Transformer for aggregating feature volumes from different views. Specifically, the Transformer’s cross-attention mechanism is used to build a 3D correlation along the epipolar line direction between the reference feature
(Query) and source features
(Keys). And use the cross-dimensional attention to guide the aggregation of feature volumes from different views, ultimately achieving cross-dimensional cost volume aggregation. The detailed structure of the module is illustrated in
Figure 4.
Common shallow 2D CNNs can only extract texture features within a fixed receptive field and struggle to capture finer details in regions with weak textures. Therefore, this paper employs the computationally intensive cascaded U-Net for query construction. Guided by Equation (1), the projection transformation of source view features restores the depth information of 2D query features. To ensure 3D consistency in depth space, we adopt a cross-attention mechanism along the epipolar line direction to establish 2D semantic and 3D spatial depth correlations. This involves the 3D correlation between the pixel feature of the reference view,
(Query), and the source features mapped to the epipolar line,
(Keys). The attention weights,
, are computed to achieve this, as shown in Equation (3).
where
represents the temperature parameter. The
are stacked along the depth dimension to form
. Previous studies [
35,
36] have indicated that utilizing group-wise correlations to group feature volumes can reduce the computational and storage requirements of the model during cost volume construction. Therefore, this paper employs group-wise correlations to partition the feature volumes into
groups along the feature dimension, where
. Based on the inner product calculation in Equation (4), the similarity
is computed between the source view feature volumes and the reference view feature volume. The obtained
serves as the values for the cross-attention mechanism.
In this context, the
-th group feature of
is denoted as
, where
represents the inner product.
are obtained by stacking
along the channel dimension. Finally, the values of the epipolar attention mechanism are guided and aggregated for stage
n by the
, resulting in the stage-wise aggregated cost volume
. The specific operations are detailed in Formula (5).
3.3.2. Cross-Scale Cost Volume Information Exchange
Traditional multi-view stereo (MVS) algorithms often overlook the correlation between cost volumes at different scales, resulting in a lack of information transfer within each layer [
5]. To address this limitation, our study introduces a cross-scale cost volume information exchange module, outlined in
Figure 5. To address this, our module employs a portion of the Cascade Iterative Depth Estimation and Refinement (CIDER) [
34], applying a lightweight regularization to coarsely regularize the stage-wise cost volume. Subsequently, through a separation operation, this volume is integrated into the next layer. This process eliminates noise and facilitates the fusion of information from small-scale cost volumes into the subsequent layer’s cost volume, thereby enhancing the quality of depth map estimation. It separates the initially regularized stage-wise cost volume, fusing it into the next layer. This process not only eliminates noise but also enables the integration of information from small-scale cost volumes into the next layer, enhancing the quality of depth map estimation. Taking the (
n − 1)-th layer as an example, the generated cost
undergoes initial regularization to acquire sufficient contextual information, followed by an upsampling operation, resulting in
, where
, represents the upsampled depth samples. This size is consistent with the subsequently generated cost volume
in the next stage. The fusion of these volumes yields the final cost volume
for that stage.
3.4. Dynamic Depth Range Sampling
An appropriate depth sampling range is crucial for comprehensive coverage of real depth values, playing a vital role in generating high-quality depth maps. Conventional methods typically focus on the distribution of individual pixels in the probability volume, adjusting the depth sampling range for the next stage based on this information. Zhang et al. [
37] introduced a novel approach leveraging the information entropy of a probability volume to fuse feature volumes from different perspectives. Motivated by this, we propose an uncertainty module to adapt the depth sampling range. This module takes the information entropy of the probability volume from Stage 1 as input to assess the reliability of depth inferences. A higher output from the Uncertainty Module indicates greater uncertainty in the current pixel’s depth estimation. Consequently, in Stage 0, the depth sampling range is expanded correspondingly to comprehensively cover true depth values, as illustrated in
Figure 1. The module comprises five convolutional layers and activation functions, producing output values between 0 and 1. Higher values signify increased uncertainty. The uncertainty interval
for the pixel
in the next stage is calculated using Equation (6).
where
is the hyperparameter defining the confidence interval,
represents the entropy map of the probability volume,
denotes the uncertainty module for the probability volume, and
is the predicted depth value for the current pixel.
3.5. Multi-Stage Joint Learning
3.5.1. Cross-Entropy Based Learning Objective
Regularization operations yield a probability volume with dimensions
, storing the matching probabilities between pixels and different depth values. The paper departs from utilizing the Smooth
L1 loss to minimize the disparity between predicted and actual values. Instead, it addresses a multi-sampled depth value classification problem as an alternative to conventional depth estimation methods. In Stages 0 and 2, the cross-entropy loss function is employed to quantify the difference between the true probability distribution
and the predicted probability distribution
for each pixel
.
3.5.2. Uncertainty-Based Learning Objectives
In Stage 1, the paper dynamically adjusts the depth sampling range from Stage 0 based on the uncertainty of pixel distribution in the probability volume. Additionally, a negative log-likelihood minimization constraint is incorporated into the loss function of Stage 0 to jointly learn depth value classification and its uncertainty
. The loss function for the second stage is outlined in Equation (8).
3.5.3. Joint Learning Objective
The constants
and
, all belonging to the interval
, represent the weights assigned to the learning objectives of the three stages. The overarching goal of multi-stage joint learning is to minimize the overall loss function, defined as follows:
5. Conclusions
This paper proposes an uncertainty-epipolar Transformer multi-view stereo network (U-ETMVSNet) for object stereo reconstruction. Initially, an enhanced Cascaded U-Net is employed to bolster both feature extraction and query construction within the epipolar Transformer. The epipolar Transformer, along with the cross-scale information exchange module, enhances the correlation of cross-dimensional information during cost volume aggregation, ensuring 3D consistency in depth space. The dynamic adjustment of depth sampling range based on the uncertainty of the probability volume also enhance stability in reconstructing regions with weak texture, and the reconstruction performance remains excellent even at lower depth sampling rates. Finally, the multi-stage joint learning method based on multi-depth value classification solution also effectively improves the reconstruction accuracy. The proposed method in this paper exhibits excellent performance in terms of completeness, accuracy, and generalization ability on the DTU and Tanks&Temples datasets, comparable to existing mainstream CNN-based MVS networks. However, the algorithm retains common 3D CNN regularization modules, resulting in no significant advantage in terms of memory usage. Future work aims to explore the role of Transformers in dense feature matching to replace CNN regularization, enhancing the practicality of deploying the model on mobile devices.