Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network

Yao, Guobiao; Wang, Ziheng; Wei, Guozhong; Zhu, Fengqi; Fu, Qingqing; Yu, Qian; Wei, Min

doi:10.3390/ijgi14020043

Open AccessArticle

Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network

by

Guobiao Yao

^1,*

,

Ziheng Wang

¹,

Guozhong Wei

²,

Fengqi Zhu

²,

Qingqing Fu

¹,

Qian Yu

² and

Min Wei

³

¹

School of Surveying and Geo-Informatics, Shandong Jianzhu University, Jinan 250101, China

²

Shandong Provincial Institute of Land Surveying and Mapping, Jinan 250101, China

³

Shandong Zhengyuan Aerial Remote Sensing Technology Co., Ltd., Jinan 250101, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(2), 43; https://doi.org/10.3390/ijgi14020043

Submission received: 26 October 2024 / Revised: 9 January 2025 / Accepted: 22 January 2025 / Published: 24 January 2025

(This article belongs to the Topic 3D Computer Vision and Smart Building and City, 2nd Volume)

Download

Browse Figures

Versions Notes

Abstract

:

Aiming to address the issue that existing multi-view stereo reconstruction methods have insufficient adaptability to the repetitive and weak textures in multi-view images, this paper proposes a three-dimensional (3D) reconstruction algorithm based on Feature Enhancement and Weight Optimization MVSNet (Abbreviated as FEWO-MVSNet). To obtain accurate and detailed global and local features, we first develop an adaptive feature enhancement approach to obtain multi-scale information from the images. Second, we introduce an attention mechanism and a spatial feature capture module to enable high-sensitivity detection for weak texture features. Third, based on the 3D convolutional neural network, the fine depth map for multi-view images can be predicted and the complete 3D model is subsequently reconstructed. Last, we evaluated the proposed FEWO-MVSNet through training and testing on the DTU, BlendedMVS, and Tanks and Temples datasets. The results demonstrate significant superiorities of our method for 3D reconstruction from multi-view images, with our method ranking first in accuracy and second in completeness when compared to the existing representative methods.

Keywords:

multi-view; MVSNet; transformer; depth estimation; 3D reconstruction

1. Introduction

Multi-view stereo (MVS) reconstruction focuses on using two or more perspective photographs to recover the geometric surface structure information of a target scene [1]. It is a popular technique [2] used in the domains of actual 3D construction, historic heritage restoration and preservation, and other related areas [3]. However, the 3D reconstruction of multi-view images with weak textures and repetitive patterns is still challenging in both digital photogrammetry and computer vision fields [4].

Despite the significant progress made by traditional MVS reconstruction methods such as Gipuma [5] and COLMAP [6], the accuracy and completeness of reconstructed point clouds in complex scenes remain insufficient. We aim to address the limitations of traditional methods. In recent years, the development of 3D reconstruction has been strongly boosted with the introduction of convolutional neural networks (CNNs) into the field of scene depth estimation. Han et al. proposed MatchNet [7], which mainly consists of a metric model composed of three fully connected layers to compute the similarity between the features to be matched. To improve the automation of the method, Yao et al. constructed the end-to-end depth estimation network MVSNet [8], which significantly improves the accuracy of the target depth information, but it also consumes a large amount of memory space when constructing the cost model and calculating the feature information of multiple views. For this reason, Yao et al. further propose the recursive stereo visual network R-MVSNet [9], which regularizes the original 2D feature cost map by gated recursive units, effectively reducing the memory occupation of the model. Gu et al.’s study incorporates cascaded convolutional network layers and multi-scale strategies into the network and proposes a multi-scale depth estimation model called CasMVSNet [10], and the experimental results validate the advantages of this method in terms of computational accuracy and efficiency. Wang et al., inspired by the traditional PatchMatch algorithm, propose the deep learning matching network PatchMatchNet [11], which quickly finds the best matching region in an image by introducing global and local optimization strategies, and the test results show that it is able to generate depth information for complex environments. To enhance feature extraction in multi-scale space, ASPPMVSNet [12] introduces Atrous Spatial Pyramid Pooling (ASPP) into the model, using dilated convolutions for multi-scale feature information extraction. To improve the correlation of feature information in both the channel and spatial dimensions during the feature extraction process, DRI-MVSNet [13] builds upon CasMVSNet by introducing a Channel and Spatial Combined Processing (CSCP) module to capture relevant contextual information.

While traditional methods and CNN-based networks have significantly improved depth estimation in multi-view stereo (MVS) geometry, transformers [14] have a strong ability to capture global contextual features in MVS tasks, which significantly enhances the representation of various features and injects new impetus into the development of 3D reconstruction. Both MVSTER [15] and MVSTR [16] leverage transformer architectures to improve MVS reconstruction. However, MVSTR focuses on enhancing feature matching through global context and inter-view interactions with its transformer-based modules, while MVSTER emphasizes optimizing depth estimation through epipolar geometry, monocular depth enhancement, and cascade refinement. Both methods achieve good performance, with MVSTER being more efficient and MVSTR being more robust across benchmarks. However, we must consider the critical issues of weak textures and repetitive textures in the process of MVS 3D reconstruction for improving the accuracy and completeness of the model. To address this issue, Zhu et al. propose to construct DCNv2 [17], which is an approach that improves the original DCN (Deep Crossing Network) by utilizing the variable sensory field convolution, which effectively improves the loss problem of feature information propagation. Ding et al. constructed a feature matching converter TransMVSNet [18] using a transformer, inserting an Adaptive Receptive Field (ARF) module after the FPN. This approach improves the reconstruction quality of texture-deprived regions by adaptively expanding the receptive field. Weak textures create sparse features and matching ambiguity, while repetitive textures easily lead to mismatches, further degrading reconstruction quality. These challenges suggest an urgent need to enhance algorithms for improving the quality of reconstruction in these scenarios.

The above methods show great advantages and potentials in multi-view 3D reconstruction and can also provide effective references for higher-level 3D reconstruction. However, the tests of the above methods in various scenarios show that they often find it difficult to reconstruct a complete and reliable 3D model in repetitive- or weak-texture regions. Inspired by TransMVSNet, this paper proposes a 3D reconstruction algorithm based on Feature Enhancement and Weight Optimization MVSNet (FEWO-MVSNet). Our method firstly extracts global and local feature information from the multi-view images using the adaptive feature enhancement module, then introduces the adaptive optimization mechanism for weight allocation to estimate the fine depth map of the scenery with weak/repeated textures and finally completes the reconstruction of the dense 3D point cloud. Through the training and testing of multiple international standard datasets, the test results show that our FEWO-MVSNet achieves improvements in both accuracy and completeness.

2. Methodology

AS shown in Figure 1, we designed a 3D reconstruction network called FEWO-MVSNet based on adaptive feature enhancement and weight allocation, using CasMVSNet as the base network. The improved work in our method includes the following:

Introduce an Adaptive Spatial Feature Fusion (ASFF) module [19] on the basis of a Feature Pyramid Network (FPN) to enhance the capability of capturing feature information at different scales;
Adaptively expand the search range of features through the deformable feeler field convolution module DCNv2 [17] and combine it with the transformer positional encoder for the enhancement of global contextual feature information aggregation;
Design an Adaptive Space Weight Allocation (ASWA) module, which is integrated into SENet [20], to highlight the low-frequency information in the convolutional channel and color space and then realize the dense extraction of feature information of multi-view images containing weak- and repetitive-texture regions.

The algorithm of FEWO-MVSNet can be briefly summarized as follows: Firstly, the multi-scale features of each image at three resolutions are extracted using the FPN. Secondly, in view of inconsistent scales between different features, the feature maps are input into the Adaptive Spatial Feature Fusion (ASFF) module to achieve information enhancement and feature normalization. Thirdly, in order to improve the accuracy of the multi-view feature matching and feature aggregation, the deformable receptive field module DCNv2 and the transformer feature aggregation module are combined to aggregate the updated feature information (for the purpose of emphasizing the adequate feature information that has been captured, we spatially label the features using SENet and ASWA). Fourthly, using differentiable homography, we obtain the multi-view image for feature mapping, and the cost volume can be generated based on Spatial Feature Fusion Spatial Feature Fusion depth correlation weighting method and feature transform. Then, cost volume regularization is performed by the 3D CNN to generate the probability volume that can predict the depth maps. Lastly, we blend each depth map using a concatenated scheme and obtain dense point clouds of the target scenery. In all, we realize the end-to-end 3D reconstruction of the model based on FEWO-MVSNet.

2.1. Feature Extraction Networks

Deep learning-based MVS methods typically use downsampling in the feature extraction module to expand the receptive field while reducing image resolution to save memory space. Based on the most downsampling, this produces the final feature map, which is directly input into subsequent networks. This process often causes texture information loss, which negatively impacts the accuracy of reconstruction results. For this, we introduce an FPN to replace the original CNN to realize the feature extraction with different scales from the top to the bottom of the pyramid network. Based on many tests, we find that feature maps from different layers of the hierarchical pyramid vary in terms of resolution, receptive fields, and semantic representation. These scale differences cause information loss during top-down feature fusion. The loss of critical feature information causes inaccuracies in scene reconstruction and leads to missing local details or surface discontinuities. Therefore, the Adaptive Spatial Feature Fusion (ASFF) module is further introduced after the feature extraction network FPN to obtain the optimal weights of each channel through adaptive learning, which is conducive to maintaining the invariance of the pyramid image feature-extracting ratio to realize the intelligent extraction of complex texture features.

The improved feature extraction network is shown in Figure 2. The main content of this network can be presented as follows. Firstly, given an input image with W × H resolution as well as its elements of internal and external orientation, the image is input into the bottom-up feature extraction part, which reduces the resolution of the image using 2 × 2 downsampling to increase its semantic information. Secondly, the network uses the top-down path to accomplish the fusion between high-level images and low-level images based on 2 × 2 upsampling. Thirdly, the top-down feature fusion path is connected to the bottom-up feature extraction path through the lateral connections, and those lateral connections combine the high-resolution information of the left-part images with the rich semantic information of the right-part images. Fourthly, we incorporate the Adaptive Spatial Feature Fusion (ASFF) module with the FPN [21]. Finally, we have three image channels with different resolutions:

W / 4 \times H / 4, W / 2 \times H / 2, W \times H

. The adaptive weighting coefficients (

α, β

, and

γ

), and the corresponding feature layer

L_{i}

merge together by weighting. The fused feature

Z_{i j}^{l}

is then obtained based on the formula, as follows.

Z_{i j}^{l} = α_{i j}^{l} \cdot L_{i j}^{1 \to l} + β_{i j}^{l} \cdot L_{i j}^{2 \to l} + γ_{i j}^{l} \cdot L_{i j}^{3 \to l}

(1)

where

L_{i j}^{n \to l}

represents the feature vector at the position (i, j) on the feature maps resized from level n to level l. As shown in Equation (1), the fused feature

Z_{i j}^{l}

is computed using an additive approach. Therefore, the output features from levels 1 to 3 must have the same size and number of channels when being added. To achieve this, features from different levels require upsampling or downsampling, along with adjustments to the number of channels. For the weight parameters α, β and γ, they are obtained by applying a 1 × 1 convolution to the resized feature maps from levels one to three. After concatenation, the parameters α, β and γ undergo a SoftMax operation [22] using Equation (2), which normalizes their values to the range [0,1] and ensures that their sum equals one.

α = \frac{e^{λ_{α}}}{e^{λ_{α}} + e^{λ_{β}} + e^{λ_{γ}}}

(2)

where

α

,

β

and

γ

are generated using three SoftMax control parameters:

λ_{α}

,

λ_{β}

and

λ_{γ}

. This may adaptively fuse the features from all levels at each scale.

2.2. Aggregation and Enhancement for Features Based on Transformer

2.2.1. Adaptive Enlargement of Receptive Field

Due to the significant difference in the receptive field of contextual information between the FPN and transformer, the feature information can be lost easily in the process of transmission. For the purpose of correct transmission, we add the DCNv2 between the modules of the FPN and transformer. In the convolution operation, the DCNv2 can compensate adaptive offsets of texture, which are generated by the learning of the network. This network can adaptively adjust the receptive field according to the local features of the image; thus, it can extract texture information with different brightnesses and types. The adaptive offset matrix can be computed by Equation (3) as follows.

M (x_{0}) = \sum_{x_{n} \in R} ε (x_{n}) \cdot l (x_{0} + x_{n} + ∆ x_{n}) \cdot ∆ m_{k}

(3)

where

M (x_{0})

represents the offset of sampled intensity,

ε (x_{n})

is the deformable convolution parameter,

l

is the input feature map,

x_{0}

is the sampling point in the feature map,

x_{n}

is the sampling point of the offset,

∆ x_{n}

is the offset matrix, and

∆ m_{k}

is the weight of the sampling point.

2.2.2. Feature Encoding Based on Transformer

Among the existing multi-view stereo reconstruction approaches, MVS reconstruction demands the comprehensive exploitation of information from the entire scene for depth estimation. However, traditional CNNs often encounter limitations when dealing with large-scale contextual information. The transformer mechanism is capable of establishing the relationships among pixels within a global scope; it can make full use of global contextual information and improve the distinctiveness of feature information. For the purpose of enhancing the accuracy and robustness of the feature map, we employ the FPN, followed by the transformer feature encoding module [14], including self-attention and cross-attention units. The extracted feature map L is flattened into a one-dimensional vector. Then, we calculate the encoding position based on the feature information of the flattened map by Equation (4).

P E = [\begin{matrix} \sin (p o s / 10000^{n / d_{m o d e l}}) \\ c o s (p o s / 10000^{n / d_{m o d e l}}) \end{matrix}]

(4)

where PE denotes the position encoding, pos denotes the position information of each pixel, and

d_{m o d e l}

represents the output dimension of the encoding–decoding model; this equation uses the sine and cosine functions to alternately express the encoding information at each position.

The transformer module applies positional encoding to the reference features

Z_{0}

and input features

Z_{i}

. This enhances positional consistency and robustness across different resolutions. The self-attention mechanism processes the encoded features to extract global dependencies, while the cross-attention mechanism enables interactions between the reference features and input features to capture contextual information. These processes ultimately update and refine the feature information. Suppose that

Z_{i}

represents the initial feature information, the feature enhancement for

Z_{i}

can be given as follows. Firstly, three attention vectors, namely Query (Q), Key (K) and Value (V), are input into the feature encoder simultaneously. The context information of V is retrieved from the corresponding Q and K based on the multi-layer stacking of the self-attention mechanism. At the same time, it captures the dependencies existing in the input sequence to produce new feature information

Z_{i}^{'}

. Secondly, the cross-information between the reference image and the input image is captured by the cross-attention mechanism. Finally, the updating calculation for the new feature information

Z_{i}^{'}

is performed. The updated result

Z_{i}^{″}

comprises the dense features with the global and local context information.

2.3. Adaptive Allocation for Feature Weights

To obtain the key feature information with adaptive dimensions, we integrate the SENet with ASWA in our model to generate adaptive weights for features, which can be beneficial to deal with different view images with various types of textures. The adaptive allocation for feature weight is illustrated by Figure 3.

The aforementioned strategy of Section 2.2 can effectively enhance the perception of the model to recognize weak features as more as possible. However, it may also lead to the accumulation of redundant feature information, further resulting in high computational complexity and preventing it from capturing key features. To dynamically adjust the weights of various feature regions, we introduce SENet into our model to effectively extract weak-texture features. In this part, we optimize the channel attention network SENet. The correlation of feature channels is enhanced through squeeze, excitation, and adaptive weighting operations. Specifically, the key feature information

z_{i}

is produced by Equation (5).

z_{i} = Z_{i}^{″} \cdot s i g m o i d (φ_{2} \cdot R e L u (φ_{1} \cdot G A P (L_{i})))

(5)

where

s i g m o i d

and

R e L u

are the activation functions,

φ_{1}

and

φ_{2}

denote the weight factors of the fully connected layers,

G A P

denotes the global average pooling function and

L_{i}

represents the arbitrary feature layer.

The SENet is trained for the adaptive adjustment of the weight coefficients

φ_{1}

and

φ_{2}

through forward and backward propagation, which guarantee the reliability of feature information in all channels. Similarly, the weight allocation for 3D spatial feature information extraction is also important. To adaptively adjust the spatial feature weights according to feature textures, the ASWA module is incorporated after the SENet. Then, the activation function

s i g m o i d

is used to normalize the weight coefficients, which enables the robust capture of feature information from each spatial channel.

2.4. Correlation Volume Construction and Loss Function Estimation

2.4.1. Correlation Volume Construction

Similar to most existing deep learning-based MVSNet methods, all the input image features

Z_{i}^{″}

can be aligned to the reference image features

Z_{0}^{″}

through differentiable warping [23]. To start with, the transformation between point

p_{i}

of the reference image and point

p_{i}^{'}

of the input image can be expressed by Equation (6) under the depth hypothesis

d_{i}

.

p_{i}^{'} = K [R (K_{0}^{- 1} p_{i} d_{i}) + t]

(6)

where R and t, respectively, denote the rotation and translation matrices between the input image and the reference image, and

K_{0}

and

K

are the calibration matrices of the camera. Next, we calculate the correlation between corresponding points based on Equation (6) and the bilinear interpolation method. The correlation metric is expressed by Equation (7).

c_{d_{i}} (p_{i}) = < F_{0} (p_{i}), F_{d_{i}} (p_{i}^{'}) >

(7)

where

F_{0} (p_{i})

is the feature map of the reference image,

F_{d_{i}} (p^{'})

represents the i-th feature map of the input image at depth

d_{i}

, and

c_{d_{i}} (p_{i})

represents the correlation coefficient of the input image and the reference image at the pixel point

p_{i}

. To decrease the memory consumption of the regularization of correlation volume, the channel number is reduced to one. To maintain the maximum correlation of data in the depth dimension, we introduce the depth correlation weighting method [18] to aggregate the correlation volume

C (p_{i})

by Equation (8).

C (p_{i}) = \sum_{i = 1}^{X - 1} m a x \{c_{d_{i}} (p_{i})\} \cdot c_{d_{i}} (p_{i})

(8)

We chose a 3D CNN to conduct the regularization of correlation volume. The 3D CNN is composed of the 3D convolutional operations and moves among all dimensions of the correlation volume. Through the 3D CNN, we can obtain the probability volume P, which can predict depth information. To reduce the memory consumption and computational complexity of the 3D CNN, we first reduce the number of channels in the feature map to one, thereby cutting down the feature information storage for each pixel. Additionally, we aggregate X-1 pairs of correlation volumes using Equation (8) to reduce redundant or mismatched information. The two measures are beneficial to optimize 3D CNN computation.

2.4.2. Loss Function Estimation

To highlight the key information in challenging texture regions and improve the accuracy of depth estimation, we adopt the combination

ℶ

of the focal loss function

ℶ_{F L}

[24] and the smooth loss function

ℶ_{S L}

[25] using Equation (9). This can adaptively adjust the weight between the background points and the model points and thus improve the learning rate of the key regions.

ℶ = ℶ_{F L} + ℶ_{S L}

(9)

ℶ_{F L} = - α_{F L} {(1 - P_{d^{'}} (p_{i}))}^{γ} \log (P_{d^{'}} (p_{i}))

(10)

where

P_{d^{'}} (p_{i})

denotes the predicted probability at a pixel point

p_{i}

at depth hypothesis

d^{'}

,

d^{'}

denotes the depth value closest to the ground truth among all hypotheses, and

γ

is the focus parameter.

α_{F L}

is a balancing parameter that is used to adjust the influence between the background and the model. According to experience,

γ = 0

is suitable for relatively simple scenarios and

γ = 2

fits more complicated scenarios.

ℶ_{S L} = \{\begin{matrix} \frac{1}{2} s^{2}, |s| \leq 1 \\ (|s| - \frac{1}{2}), |s| > 1 \end{matrix}

(11)

where

s

denotes the gaps between the predicted values and the target values.

ℶ_{S L}

adjusts the model more robustly through

s

. This can restrict the gradient of outliers from two aspects, as follows. First, when

s

is large, the gradient value is prevented from being excessively large. Second, when

s

is small, the gradient value is sufficiently small.

ℶ_{S L}

can also provide a smoother response to outliers during the training process. Finally, we use the combination

ℶ

of

ℶ_{F L}

and

ℶ_{S L}

as our loss function for model training to estimate depth information.

2.5. Depth Map Filtering and Fusion

In this part, we reconstruct a complete 3D point-cloud model as follows. Firstly, the depth map outliers and background points can be filtered through photometric and geometric constraints. The filtering parameters are adjusted based on dynamic operations [26]. Secondly, we convert the pixel coordinates of valid points from the filtered depth maps into world coordinates and extract the corresponding color information for each point. Finally, we fuse the depth maps to generate the final dense 3D point cloud.

3. Results and Discussion

3.1. Experimental Datasets

Three large open datasets—DTU, BlendedMVS and Tanks and Temples—are chosen to train our FEWO-MVSNet and the representative models. A detailed introduction to these datasets is shown in Table 1.

3.2. Experimental Details

We train the FEWO-MVSNet on the DTU dataset with PyTorch 1.11.0. In the training phase, we set the number of input data images to N = 5 and the resolution to 640 × 512 pixels. The depth hypotheses are sampled from 425 mm to 935 mm. The FEWO-MVSNet adopts a three-stage cascaded network. The number of plane sweeping depth hypotheses of each section is 48, 32, and 8, respectively. To test the generalization of FEWO-MVSNet, we employ BlenedMVS to fine-tune the model. The input image data are also set to N = 5, with a resolution of 768 × 576 pixels. The FEWO-MVSNet is trained using Adam for 16 epochs with an initial learning rate of 0.001. The learning rate is halved, respectively, after the 6, 8, and 12 epochs. We used the NVIDIA GeForce GTX 3090 to train FEWO-MVSNet. The batch size is one.

3.3. Result and Analysis

3.3.1. Comparative Experiments

The FEWO-MVSNet is validated for effectiveness on the DTU dataset. We evaluate the Accuracy (Acc), Completeness (Comp), and Overall scores of the reconstructed 3D point cloud using the MATLAB 2018 code provided by DTU. The Acc and Comp scores are employed to evaluate the average distance between the ground-truth point cloud and the reconstructed point cloud of the model. The Overall score is the average of Acc and Comp found by Equation (12).

O v e r a l l = \frac{A c c . + C o m p .}{2}

(12)

(1): The lower values of the three metrics Acc, Comp, and Overall indicate that the reconstructed point cloud is closer to the real point cloud. We compare the FEWO-MVSNet with the traditional methods (Gipuma, Colmap), methods of deep learning (MVSNet, R-MVSNet, CasMVSNet, DRI-MVSNet, ASPPMVSNet, PatchMatchNet), and methods of deep learning using a transformer (MVSTR, MVSTER, TransMVSNet). Table 2 shows the quantitative comparison results of the DTU dataset. Compared to the classic deep learning algorithms MVSNet and CasMVSNet, FEWO-MVSNet improves accuracy by 8% and 1.2% while enhancing completeness by 15% and 4.3%. The accuracy increases by 2% and 3.7% compared to the TransMVSNet and MVSTER algorithms using a transformer.
(2): Figure 4 shows the dense point-cloud reconstruction. The FEWO-MVSNet improves the weak/repetitive textures by combining weight optimization and feature enhancement. In the red box in Figure 4, we can see the oscilloscope dial and side in scan11, the left and right sides of the beer packaging in scan12, the top of the cup in scan48 and the top of the sculpture in scan118. We enhance the texture information in the blank areas of these scenes. FEWO-MVSNet can produce denser and complete point clouds while preserving more details.
(3): We conduct comparative experiments between the typical TransMVSNet and our FEWO-MVSNet in four repetitive-/weak-texture scenes, as shown in Figure 5 and Table 3. In the black boxes of Figure 5, it is evident that FEWO-MVSNet significantly increases the point-cloud density across the four scenes. In Table 3, we compare the three metrics Acc, Comp, and Overall through experiments. FEWO-MVSNet surpasses TransMVSNet in all three metrics, and thus we effectively improve the reconstruction quality in repetitive-/weak-texture regions.

3.3.2. Generalization Experiments

To verify the generalization ability of the FEWO-MVSNet, we train the model by using the BlendedMVS dataset and evaluate it by using the Tanks and Temples dataset. We present the point clouds generated by FEWO-MVSNet to the official website of Tanks and Temples, which enables us to obtain the F-score. This metric indicates that a higher value means better reconstruction quality. To achieve quantitative comparison results for the intermediate set, we compare the FEWO-MVSNet with the traditional methods, methods of deep learning, and methods of deep learning using transformer.

(1): The Tanks and Temples (Inter) dataset includes many complex real-world scenes with intricate geometries, repetitive textures, lighting variations, and occlusions, making the reconstruction highly challenging. Gipuma is designed with a focus on dense matching and depth estimation but lacks the modularity and adaptability of methods like COLMAP. Gipuma is more commonly used in standardized laboratory settings (e.g., the DTU dataset), whereas its practical effectiveness is limited when dealing with more complex outdoor scenes. Therefore, only COLMAP is quantitatively evaluated in Table 4.
(2): Table 4 presents the quantitative testing results of different methods on the Tanks and Temples (Inter) dataset. Mean represents the average metric value across all scenes. A higher value of this metric indicates better reconstruction quality. We compare FEWO-MVSNet with 10 other methods, and the Mean ranks first among all methods. Compared to the classic deep learning algorithms MVSNet and CasMVSNet, FEWO-MVSNet improves Mean by 20% and 7.2%. The Mean increases by 0.16% and 2.7% compared to the TransMVSNet and MVSTER algorithms using the transformer.
(3): The experiments show that the algorithm proposed in this article can achieve better reconstruction results in the case of multiple influencing factors such as outdoor scene light and image noises. Figure 6 presents the results for the intermediate dataset scenarios of Francis, Playground, Family, M60, Panther, and Train. The rich texture information has been successfully caught by FEWO-MVSNet, even in these weak-texture areas like the Francis surface, and large-scale scenes like the Playground. The F-scores of the method in this article are better than those of existing methods. The enhancement of the effect in repetitive- and weak-texture regions proves that FEWO-MVSNet has a certain generalization capability.

3.4. Ablation Experiment

To examine the effect of the modules added in FEWO-MVSNet on the reconstruction results, we employ DTU as the dataset for the ablation experiments. The baseline is built on the CasMVSNet cascade architecture, employing an FPN feature extraction network and variance-based metrics, without any additional modules. We conduct four ablation experiments, sequentially integrating the ASFF, DCNv2, and SENet-ASWA modules. The ablation experiments adopt the same evaluation metrics as DTU. This is done to demonstrate the effectiveness of FEWO-MVSNet.

The results of the ablation experiments in Table 5 show that each module contributes to improving point-cloud reconstruction. The ASFF and DCNv2 modules capture local and global feature information, which enhances the feature extraction ability of FEWO-MVSNet. The SE-ASWA module filters the extracted feature information. It extracts key information through adaptive weight allocation. These modules improve weak/repetitive textures and detailed features. The reconstruction results show improvements in both accuracy and completeness.

The introduction of the ASFF, DCNv2 and SE-ASWA modules and the original transformer increases the memory consumption and computational complexity of FEWP-MVSNet. We train the model on a single GPU. As shown in Table 6, under the same environment, the running speed of FEWO-MVSNet decreases by 14.5% compared to TransMVSNet, and memory usage increases by 1.6%. In the next step, we would optimize memory consumption and running speed by training the model on multiple GPUs and incorporating the technology of lightweight modules.

4. Conclusions

We construct a novel multi-view stereo network, FEWO-MVSNet, which is based on a FEWO strategy and a TransMVSNet framework. First, the Adaptive Spatial Feature Fusion (ASFF) module is used to improve the feature learning capability of the FPN, and then the DCNv2 and transformer mechanisms are combined to enhance the context information of features. Next, we employ the SENet-ASWA module to optimize the redundant elements in the extracted feature and finally produce a rich and accurate 3D dense point cloud. Extensive experiments on standard datasets, such as DTU, BlendedMVS, and Tanks and Temples, demonstrate that the proposed method is superior to many existing deep learning-based methods for weak- and repetitive-texture regions. Our algorithm still has limitations in terms of operational efficiency and reconstruction completeness. In future work, we aim to use lightweight modules to replace complex computations, further improving the performance and adaptability of the algorithm, making it more suitable for various scenarios.

Author Contributions

Conceptualization, Guobiao Yao and Guozhong Wei; methodology, Guobiao Yao and Ziheng Wang; software, Guobiao Yao and Ziheng Wang; validation, Guobiao Yao, Ziheng Wang and Fengqi Zhu; formal analysis, Guobiao Yao and Guozhong Wei; investigation, Ziheng Wang and Qingqing Fu; resources, Guozhong Wei and Fengqi Zhu; data curation, Ziheng Wang and Min Wei; writing—original draft preparation, Guobiao Yao and Ziheng Wang; writing—review and editing, Guobiao Yao; visualization, Ziheng Wang and Qian Yu; supervision, Guobiao Yao; funding acquisition, Qian Yu. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China with Project No. 42171435, the Shandong Provincial Natural Science Foundation with Project No. ZR2021MD006, the Postgraduate Education and Teaching Reform Foundation of Shandong Province with Project Province with Project No. SDYJG19115, and the Undergraduate Education and Teaching Reform Foundation of Shandong Province with Project No. Z2021014. This work was also funded by the high-quality graduate course of Shandong Province with Project No. SDYKC2022151.

Data Availability Statement

The case data can be downloaded from GitHub HuanHuanWZH/MVS-Date: MVS 3D reconstruction (accessed on 30 September 2024).

Acknowledgments

The authors would like to thank Yikang Ding, Xizhou Zhu, and Jie Hu for providing their key algorithms.

Conflicts of Interest

Author Min Wei was employed by the company Shandong Zhengyuan Aerial Remote Sensing Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Luo, H.; Zhang, J.; Liu, X.; Zhang, L.; Liu, J. Large-Scale 3D Reconstruction from Multi-View Imagery: A Comprehensive Review. Remote Sens. 2024, 16, 773. [Google Scholar] [CrossRef]
Dong, Y.; Song, J.; Fan, D.; Ji, S.; Lei, R. Joint Deep Learning and Information Propagation for Fast 3D City Modeling. ISPRS Int. J. Geo-Inf. 2023, 12, 150. [Google Scholar] [CrossRef]
Bi, J.; Wang, J.; Cao, H.; Yao, G.; Wang, Y.; Li, Z.; Sun, M.; Yang, H.; Zhen, J.; Zheng, G. Inverse distance weight-assisted particle swarm optimized indoor localization. Appl. Soft Comput. 2024, 164, 112032. [Google Scholar] [CrossRef]
Gao, X.; Yang, R.; Chen, X.; Tan, J.; Liu, Y.; Wang, Z.; Tan, J.; Liu, H. A New Framework for Generating Indoor 3D Digital Models from Point Clouds. Remote Sens. 2024, 16, 3462. [Google Scholar] [CrossRef]
Galliani, S.; Lasinger, K.; Schindler, K. Massively parallel multiview stereopsis by surface normal diffusion. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 873–881. [Google Scholar]
Schonberger, J.L.; Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
Han, X.; Leung, T.; Jia, Y.; Sukthankar, R.; Berg, A.C. Matchnet: Unifying feature and metric learning for patch-based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3279–3286. [Google Scholar]
Yao, Y.; Luo, Z.; Li, S.; Fang, T.; Quan, L. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 767–783. [Google Scholar]
Yao, Y.; Luo, Z.; Li, S.; Shen, T.; Fang, T.; Quan, L. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5525–5534. [Google Scholar]
Gu, X.; Fan, Z.; Zhu, S.; Dai, Z.; Tan, F.; Tan, P. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2495–2504. [Google Scholar]
Wang, F.; Galliani, S.; Vogel, C.; Speciale, P.; Pollefeys, M. Patchmatchnet: Learned multi-view patchmatch stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14194–14203. [Google Scholar]
Saeed, S.; Lee, S.; Cho, Y.; Park, U. ASPPMVSNet: A high-receptive-field multiview stereo network for dense three-dimensional reconstruction. ETRI J. 2022, 44, 1034–1046. [Google Scholar] [CrossRef]
Li, Y.; Li, W.; Zhao, Z.; Fan, J. DRI-MVSNet: A depth residual inference network for multi-view stereo images. PLoS ONE 2022, 17, e0264721. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Wang, X.; Zhu, Z.; Huang, G.; Qin, F.; Ye, Y.; He, Y.; Chi, X.; Wang, X. Mvster: Epipolar transformer for efficient multi-view stereo. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 573–591. [Google Scholar]
Zhu, J.; Peng, B.; Li, W.; Shen, H.; Zhang, Z.; Lei, J. Multi-view stereo with transformer. arXiv 2021, arXiv:2112.00336. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Ding, Y.; Yuan, W.; Zhu, Q.; Zhang, H.; Liu, X.; Wang, Y.; Liu, X. Transmvsnet: Global context-aware multi-view stereo network with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8585–8594. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Teng, J.; Sun, H.; Liu, P.; Jiang, S. An Improved TransMVSNet Algorithm for Three-Dimensional Reconstruction in the Unmanned Aerial Vehicle Remote Sensing Domain. Sensors 2024, 24, 2064. [Google Scholar] [CrossRef]
Wang, G.; Wang, K.; Lin, L. Adaptively connected neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1781–1790. [Google Scholar]
Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar]
Ross, T.-Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar]
Jia, X.; De Brabandere, B.; Tuytelaars, T.; Gool, L.V. Dynamic filter networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar] [CrossRef]
Aanæs, H.; Jensen, R.R.; Vogiatzis, G.; Tola, E.; Dahl, A.B. Large-scale data for multiple-view stereopsis. Int. J. Comput. Vis. 2016, 120, 153–168. [Google Scholar] [CrossRef]
Yao, Y.; Luo, Z.; Li, S.; Zhang, J.; Ren, Y.; Zhou, L.; Fang, T.; Quan, L. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1790–1799. [Google Scholar]
Knapitsch, A.; Park, J.; Zhou, Q.-Y.; Koltun, V. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph. (ToG) 2017, 36, 1–13. [Google Scholar] [CrossRef]

Figure 1. The proposed 3D reconstruction network, FEWO-MVSNet.

Figure 2. The improved feature extraction network by integrating the FPN with the ASFF module.

Figure 3. The adaptive allocation mechanism for feature weight.

Figure 4. We chose three representative methods (MVSNet, CasMVSNet and TransMVSNet) for comparison. The reconstruction results in four scenes with weak/repeated textures are shown, with optimized details indicated in the red box.

Figure 5. Comparison of the reconstructed 3D dense point clouds from repetitive- and weak-texture scenes based on TransMVSNet and our FEWO-MVSNet. The reconstruction results in four scenes with weak/repeated textures are shown, with optimized details indicated in the black box.

Figure 6. The reconstruction of scenes based on FEWO-MVSNet using the Tanks and Temples (Intermediate) dataset. The reconstruction results in six scenes with weak/repeated textures are shown, with optimized details indicated in the red box.

Table 1. Experimental datasets in detail.

Data Types	Data Description	Thumbnails of Data
(1) DTU [27]	DTU is a large indoor dataset that encompasses 128 scenes. It covers the scene by adopting 49 or 63 camera positions. We divide 27,097 training samples as a training set with 79 sceneries, an evaluation set with 18 sceneries, and a test set with 22 sceneries.
(2) BlendedMVS [28]	BlendedMVS includes 113 various types of scenes, for example, cities and buildings, with a total of 17,818 images. At present, the dataset does not provide evaluation tool. Therefore, it is only used for model training in the generalization experiment
(3) Tanks and Temples [29]	Tanks and Temples is a large indoor and outdoor dataset that comprises 14 scenes of different scales. This dataset is adopted as test sets for the generalization experiments. We categorize it as an intermediate set with eight sceneries and an advanced set with six sceneries.

Table 2. Quantitative comparison results for DTU. The bold values denote the best results, and the underlined values denote the second-best results.

Methods	Acc/mm	Comp/mm	Overall/mm
Gipuma [5]	0.283	0.873	0.578
Colmap [6]	0.400	0.664	0.532
MVSNet [8]	0.396	0.527	0.462
R-MVSNet [9]	0.383	0.452	0.417
CasMVSNet [10]	0.325	0.385	0.355
DRI-MVSNet [13]	0.432	0.327	0.379
ASPPMVSNet [12]	0.334	0.360	0.347
PatchMatchNet [11]	0.427	0.277	0.352
MVSTR [16]	0.356	0.295	0.326
MVSTER [15]	0.350	0.276	0.313
TransMVSNet [18]	0.333	0.301	0.317
Ours	0.313	0.311	0.312

Table 3. Quantitative comparison results of four scenes from the DTU dataset. The bold values indicate the best results.

Scenes	Acc/mm		Comp/mm		Overall/mm
Scenes	TranMVSNet	FEWO-MVSNet	TranMVSNet	FEWO-MVSNet	TranMVSNet	FEWO-MVSNet
Scan11	0.321	0.335	0.339	0.311	0.330	0.323
Scan12	0329	0.324	0.209	0.209	0.269	0.266
Scan48	0.369	0.354	0.627	0.487	0.498	0.420
Scan118	0.229	0.229	0.318	0.317	0.274	0.273

Table 4. Quantitative testing results of different methods on Tanks and Temples (Inter). The bold values denote the best results, and the underlined values denote the second-best results.

Method	Intermediate
Method	Mean	Fam.	Fran.	Horse	L.H.	M60	Path.	P.G.	Train
Colmap [5]	42.14	50.41	22.25	26.63	56.43	44.83	46.97	48.53	42.04
MVSNet [8]	43.48	55.99	28.55	25.07	50.79	53.96	50.86	47.90	34.69
R-MVSNet [9]	50.55	73.01	54.46	43.42	43.88	46.80	46.69	50.87	45.25
CasMVSNet [10]	56.42	76.36	58.45	46.20	55.53	56.11	54.02	58.17	46.56
DRI-MVSNet [13]	52.71	73.64	53.48	40.57	53.90	48.48	46.44	59.09	46.10
ASPPMVSNet [12]	54.03	76.50	47.74	36.34	55.12	57.28	54.28	57.43	47.54
PatchMatchNet [11]	53.15	66.99	52.64	43.25	54.87	52.87	49.54	54.21	50.81
MVSTR [16]	56.93	76.92	59.82	50.16	56.73	56.53	51.22	56.58	47.48
MVSTER [15]	60.92	80.21	63.51	52.30	61.38	61.47	58.16	58.98	51.38
TransMVSNet [18]	63.52	80.92	65.83	56.89	62.54	63.06	60.00	60.20	58.67
Ours	63.68	81.09	65.08	56.92	62.18	62.79	61.27	61.34	58.75

Table 5. Comparison of quantitative results of ablation experiments.

Methods	Acc/mm	Comp/mm	Overall/mm
Baseline Net	0.351	0.339	0.345
+ASFF and DCNv2	0.334	0.322	0.328
+SE-ASWA	0.325	0.315	0.320
FEWO-MVSNet	0.313	0.311	0.312

Table 6. Comparison of GPU memory usage and runtime for a single epoch on the DTU dataset with the same input size.

Methods	Input Size (Pixels)	Depth Map Size (Pixels)	Time (s)	Memory-Usage (MB)
TransMVSNet	640 × 512	1152 × 864	22,320	11,892
FEWO-MVSNet	640 × 512	1152 × 864	25,560	12,085

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, G.; Wang, Z.; Wei, G.; Zhu, F.; Fu, Q.; Yu, Q.; Wei, M. Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network. ISPRS Int. J. Geo-Inf. 2025, 14, 43. https://doi.org/10.3390/ijgi14020043

AMA Style

Yao G, Wang Z, Wei G, Zhu F, Fu Q, Yu Q, Wei M. Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network. ISPRS International Journal of Geo-Information. 2025; 14(2):43. https://doi.org/10.3390/ijgi14020043

Chicago/Turabian Style

Yao, Guobiao, Ziheng Wang, Guozhong Wei, Fengqi Zhu, Qingqing Fu, Qian Yu, and Min Wei. 2025. "Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network" ISPRS International Journal of Geo-Information 14, no. 2: 43. https://doi.org/10.3390/ijgi14020043

APA Style

Yao, G., Wang, Z., Wei, G., Zhu, F., Fu, Q., Yu, Q., & Wei, M. (2025). Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network. ISPRS International Journal of Geo-Information, 14(2), 43. https://doi.org/10.3390/ijgi14020043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network

Abstract

1. Introduction

2. Methodology

2.1. Feature Extraction Networks

2.2. Aggregation and Enhancement for Features Based on Transformer

2.2.1. Adaptive Enlargement of Receptive Field

2.2.2. Feature Encoding Based on Transformer

2.3. Adaptive Allocation for Feature Weights

2.4. Correlation Volume Construction and Loss Function Estimation

2.4.1. Correlation Volume Construction

2.4.2. Loss Function Estimation

2.5. Depth Map Filtering and Fusion

3. Results and Discussion

3.1. Experimental Datasets

3.2. Experimental Details

3.3. Result and Analysis

3.3.1. Comparative Experiments

3.3.2. Generalization Experiments

3.4. Ablation Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI