Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network
Abstract
:1. Introduction
2. Methodology
- Introduce an Adaptive Spatial Feature Fusion (ASFF) module [19] on the basis of a Feature Pyramid Network (FPN) to enhance the capability of capturing feature information at different scales;
- Adaptively expand the search range of features through the deformable feeler field convolution module DCNv2 [17] and combine it with the transformer positional encoder for the enhancement of global contextual feature information aggregation;
- Design an Adaptive Space Weight Allocation (ASWA) module, which is integrated into SENet [20], to highlight the low-frequency information in the convolutional channel and color space and then realize the dense extraction of feature information of multi-view images containing weak- and repetitive-texture regions.
2.1. Feature Extraction Networks
2.2. Aggregation and Enhancement for Features Based on Transformer
2.2.1. Adaptive Enlargement of Receptive Field
2.2.2. Feature Encoding Based on Transformer
2.3. Adaptive Allocation for Feature Weights
2.4. Correlation Volume Construction and Loss Function Estimation
2.4.1. Correlation Volume Construction
2.4.2. Loss Function Estimation
2.5. Depth Map Filtering and Fusion
3. Results and Discussion
3.1. Experimental Datasets
3.2. Experimental Details
3.3. Result and Analysis
3.3.1. Comparative Experiments
- (1)
- The lower values of the three metrics Acc, Comp, and Overall indicate that the reconstructed point cloud is closer to the real point cloud. We compare the FEWO-MVSNet with the traditional methods (Gipuma, Colmap), methods of deep learning (MVSNet, R-MVSNet, CasMVSNet, DRI-MVSNet, ASPPMVSNet, PatchMatchNet), and methods of deep learning using a transformer (MVSTR, MVSTER, TransMVSNet). Table 2 shows the quantitative comparison results of the DTU dataset. Compared to the classic deep learning algorithms MVSNet and CasMVSNet, FEWO-MVSNet improves accuracy by 8% and 1.2% while enhancing completeness by 15% and 4.3%. The accuracy increases by 2% and 3.7% compared to the TransMVSNet and MVSTER algorithms using a transformer.
- (2)
- Figure 4 shows the dense point-cloud reconstruction. The FEWO-MVSNet improves the weak/repetitive textures by combining weight optimization and feature enhancement. In the red box in Figure 4, we can see the oscilloscope dial and side in scan11, the left and right sides of the beer packaging in scan12, the top of the cup in scan48 and the top of the sculpture in scan118. We enhance the texture information in the blank areas of these scenes. FEWO-MVSNet can produce denser and complete point clouds while preserving more details.
- (3)
- We conduct comparative experiments between the typical TransMVSNet and our FEWO-MVSNet in four repetitive-/weak-texture scenes, as shown in Figure 5 and Table 3. In the black boxes of Figure 5, it is evident that FEWO-MVSNet significantly increases the point-cloud density across the four scenes. In Table 3, we compare the three metrics Acc, Comp, and Overall through experiments. FEWO-MVSNet surpasses TransMVSNet in all three metrics, and thus we effectively improve the reconstruction quality in repetitive-/weak-texture regions.
3.3.2. Generalization Experiments
- (1)
- The Tanks and Temples (Inter) dataset includes many complex real-world scenes with intricate geometries, repetitive textures, lighting variations, and occlusions, making the reconstruction highly challenging. Gipuma is designed with a focus on dense matching and depth estimation but lacks the modularity and adaptability of methods like COLMAP. Gipuma is more commonly used in standardized laboratory settings (e.g., the DTU dataset), whereas its practical effectiveness is limited when dealing with more complex outdoor scenes. Therefore, only COLMAP is quantitatively evaluated in Table 4.
- (2)
- Table 4 presents the quantitative testing results of different methods on the Tanks and Temples (Inter) dataset. Mean represents the average metric value across all scenes. A higher value of this metric indicates better reconstruction quality. We compare FEWO-MVSNet with 10 other methods, and the Mean ranks first among all methods. Compared to the classic deep learning algorithms MVSNet and CasMVSNet, FEWO-MVSNet improves Mean by 20% and 7.2%. The Mean increases by 0.16% and 2.7% compared to the TransMVSNet and MVSTER algorithms using the transformer.
- (3)
- The experiments show that the algorithm proposed in this article can achieve better reconstruction results in the case of multiple influencing factors such as outdoor scene light and image noises. Figure 6 presents the results for the intermediate dataset scenarios of Francis, Playground, Family, M60, Panther, and Train. The rich texture information has been successfully caught by FEWO-MVSNet, even in these weak-texture areas like the Francis surface, and large-scale scenes like the Playground. The F-scores of the method in this article are better than those of existing methods. The enhancement of the effect in repetitive- and weak-texture regions proves that FEWO-MVSNet has a certain generalization capability.
3.4. Ablation Experiment
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Luo, H.; Zhang, J.; Liu, X.; Zhang, L.; Liu, J. Large-Scale 3D Reconstruction from Multi-View Imagery: A Comprehensive Review. Remote Sens. 2024, 16, 773. [Google Scholar] [CrossRef]
- Dong, Y.; Song, J.; Fan, D.; Ji, S.; Lei, R. Joint Deep Learning and Information Propagation for Fast 3D City Modeling. ISPRS Int. J. Geo-Inf. 2023, 12, 150. [Google Scholar] [CrossRef]
- Bi, J.; Wang, J.; Cao, H.; Yao, G.; Wang, Y.; Li, Z.; Sun, M.; Yang, H.; Zhen, J.; Zheng, G. Inverse distance weight-assisted particle swarm optimized indoor localization. Appl. Soft Comput. 2024, 164, 112032. [Google Scholar] [CrossRef]
- Gao, X.; Yang, R.; Chen, X.; Tan, J.; Liu, Y.; Wang, Z.; Tan, J.; Liu, H. A New Framework for Generating Indoor 3D Digital Models from Point Clouds. Remote Sens. 2024, 16, 3462. [Google Scholar] [CrossRef]
- Galliani, S.; Lasinger, K.; Schindler, K. Massively parallel multiview stereopsis by surface normal diffusion. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 873–881. [Google Scholar]
- Schonberger, J.L.; Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
- Han, X.; Leung, T.; Jia, Y.; Sukthankar, R.; Berg, A.C. Matchnet: Unifying feature and metric learning for patch-based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3279–3286. [Google Scholar]
- Yao, Y.; Luo, Z.; Li, S.; Fang, T.; Quan, L. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 767–783. [Google Scholar]
- Yao, Y.; Luo, Z.; Li, S.; Shen, T.; Fang, T.; Quan, L. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5525–5534. [Google Scholar]
- Gu, X.; Fan, Z.; Zhu, S.; Dai, Z.; Tan, F.; Tan, P. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2495–2504. [Google Scholar]
- Wang, F.; Galliani, S.; Vogel, C.; Speciale, P.; Pollefeys, M. Patchmatchnet: Learned multi-view patchmatch stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14194–14203. [Google Scholar]
- Saeed, S.; Lee, S.; Cho, Y.; Park, U. ASPPMVSNet: A high-receptive-field multiview stereo network for dense three-dimensional reconstruction. ETRI J. 2022, 44, 1034–1046. [Google Scholar] [CrossRef]
- Li, Y.; Li, W.; Zhao, Z.; Fan, J. DRI-MVSNet: A depth residual inference network for multi-view stereo images. PLoS ONE 2022, 17, e0264721. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Wang, X.; Zhu, Z.; Huang, G.; Qin, F.; Ye, Y.; He, Y.; Chi, X.; Wang, X. Mvster: Epipolar transformer for efficient multi-view stereo. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 573–591. [Google Scholar]
- Zhu, J.; Peng, B.; Li, W.; Shen, H.; Zhang, Z.; Lei, J. Multi-view stereo with transformer. arXiv 2021, arXiv:2112.00336. [Google Scholar]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
- Ding, Y.; Yuan, W.; Zhu, Q.; Zhang, H.; Liu, X.; Wang, Y.; Liu, X. Transmvsnet: Global context-aware multi-view stereo network with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8585–8594. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Teng, J.; Sun, H.; Liu, P.; Jiang, S. An Improved TransMVSNet Algorithm for Three-Dimensional Reconstruction in the Unmanned Aerial Vehicle Remote Sensing Domain. Sensors 2024, 24, 2064. [Google Scholar] [CrossRef]
- Wang, G.; Wang, K.; Lin, L. Adaptively connected neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1781–1790. [Google Scholar]
- Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar]
- Ross, T.-Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
- Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar]
- Jia, X.; De Brabandere, B.; Tuytelaars, T.; Gool, L.V. Dynamic filter networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar] [CrossRef]
- Aanæs, H.; Jensen, R.R.; Vogiatzis, G.; Tola, E.; Dahl, A.B. Large-scale data for multiple-view stereopsis. Int. J. Comput. Vis. 2016, 120, 153–168. [Google Scholar] [CrossRef]
- Yao, Y.; Luo, Z.; Li, S.; Zhang, J.; Ren, Y.; Zhou, L.; Fang, T.; Quan, L. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1790–1799. [Google Scholar]
- Knapitsch, A.; Park, J.; Zhou, Q.-Y.; Koltun, V. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph. (ToG) 2017, 36, 1–13. [Google Scholar] [CrossRef]
Data Types | Data Description | Thumbnails of Data |
---|---|---|
(1) DTU [27] | DTU is a large indoor dataset that encompasses 128 scenes. It covers the scene by adopting 49 or 63 camera positions. We divide 27,097 training samples as a training set with 79 sceneries, an evaluation set with 18 sceneries, and a test set with 22 sceneries. | |
(2) BlendedMVS [28] | BlendedMVS includes 113 various types of scenes, for example, cities and buildings, with a total of 17,818 images. At present, the dataset does not provide evaluation tool. Therefore, it is only used for model training in the generalization experiment | |
(3) Tanks and Temples [29] | Tanks and Temples is a large indoor and outdoor dataset that comprises 14 scenes of different scales. This dataset is adopted as test sets for the generalization experiments. We categorize it as an intermediate set with eight sceneries and an advanced set with six sceneries. |
Methods | Acc/mm | Comp/mm | Overall/mm |
---|---|---|---|
Gipuma [5] | 0.283 | 0.873 | 0.578 |
Colmap [6] | 0.400 | 0.664 | 0.532 |
MVSNet [8] | 0.396 | 0.527 | 0.462 |
R-MVSNet [9] | 0.383 | 0.452 | 0.417 |
CasMVSNet [10] | 0.325 | 0.385 | 0.355 |
DRI-MVSNet [13] | 0.432 | 0.327 | 0.379 |
ASPPMVSNet [12] | 0.334 | 0.360 | 0.347 |
PatchMatchNet [11] | 0.427 | 0.277 | 0.352 |
MVSTR [16] | 0.356 | 0.295 | 0.326 |
MVSTER [15] | 0.350 | 0.276 | 0.313 |
TransMVSNet [18] | 0.333 | 0.301 | 0.317 |
Ours | 0.313 | 0.311 | 0.312 |
Scenes | Acc/mm | Comp/mm | Overall/mm | |||
---|---|---|---|---|---|---|
TranMVSNet | FEWO-MVSNet | TranMVSNet | FEWO-MVSNet | TranMVSNet | FEWO-MVSNet | |
Scan11 | 0.321 | 0.335 | 0.339 | 0.311 | 0.330 | 0.323 |
Scan12 | 0329 | 0.324 | 0.209 | 0.209 | 0.269 | 0.266 |
Scan48 | 0.369 | 0.354 | 0.627 | 0.487 | 0.498 | 0.420 |
Scan118 | 0.229 | 0.229 | 0.318 | 0.317 | 0.274 | 0.273 |
Method | Intermediate | ||||||||
---|---|---|---|---|---|---|---|---|---|
Mean | Fam. | Fran. | Horse | L.H. | M60 | Path. | P.G. | Train | |
Colmap [5] | 42.14 | 50.41 | 22.25 | 26.63 | 56.43 | 44.83 | 46.97 | 48.53 | 42.04 |
MVSNet [8] | 43.48 | 55.99 | 28.55 | 25.07 | 50.79 | 53.96 | 50.86 | 47.90 | 34.69 |
R-MVSNet [9] | 50.55 | 73.01 | 54.46 | 43.42 | 43.88 | 46.80 | 46.69 | 50.87 | 45.25 |
CasMVSNet [10] | 56.42 | 76.36 | 58.45 | 46.20 | 55.53 | 56.11 | 54.02 | 58.17 | 46.56 |
DRI-MVSNet [13] | 52.71 | 73.64 | 53.48 | 40.57 | 53.90 | 48.48 | 46.44 | 59.09 | 46.10 |
ASPPMVSNet [12] | 54.03 | 76.50 | 47.74 | 36.34 | 55.12 | 57.28 | 54.28 | 57.43 | 47.54 |
PatchMatchNet [11] | 53.15 | 66.99 | 52.64 | 43.25 | 54.87 | 52.87 | 49.54 | 54.21 | 50.81 |
MVSTR [16] | 56.93 | 76.92 | 59.82 | 50.16 | 56.73 | 56.53 | 51.22 | 56.58 | 47.48 |
MVSTER [15] | 60.92 | 80.21 | 63.51 | 52.30 | 61.38 | 61.47 | 58.16 | 58.98 | 51.38 |
TransMVSNet [18] | 63.52 | 80.92 | 65.83 | 56.89 | 62.54 | 63.06 | 60.00 | 60.20 | 58.67 |
Ours | 63.68 | 81.09 | 65.08 | 56.92 | 62.18 | 62.79 | 61.27 | 61.34 | 58.75 |
Methods | Acc/mm | Comp/mm | Overall/mm |
---|---|---|---|
Baseline Net | 0.351 | 0.339 | 0.345 |
+ASFF and DCNv2 | 0.334 | 0.322 | 0.328 |
+SE-ASWA | 0.325 | 0.315 | 0.320 |
FEWO-MVSNet | 0.313 | 0.311 | 0.312 |
Methods | Input Size (Pixels) | Depth Map Size (Pixels) | Time (s) | Memory-Usage (MB) |
---|---|---|---|---|
TransMVSNet | 640 × 512 | 1152 × 864 | 22,320 | 11,892 |
FEWO-MVSNet | 640 × 512 | 1152 × 864 | 25,560 | 12,085 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yao, G.; Wang, Z.; Wei, G.; Zhu, F.; Fu, Q.; Yu, Q.; Wei, M. Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network. ISPRS Int. J. Geo-Inf. 2025, 14, 43. https://doi.org/10.3390/ijgi14020043
Yao G, Wang Z, Wei G, Zhu F, Fu Q, Yu Q, Wei M. Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network. ISPRS International Journal of Geo-Information. 2025; 14(2):43. https://doi.org/10.3390/ijgi14020043
Chicago/Turabian StyleYao, Guobiao, Ziheng Wang, Guozhong Wei, Fengqi Zhu, Qingqing Fu, Qian Yu, and Min Wei. 2025. "Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network" ISPRS International Journal of Geo-Information 14, no. 2: 43. https://doi.org/10.3390/ijgi14020043
APA StyleYao, G., Wang, Z., Wei, G., Zhu, F., Fu, Q., Yu, Q., & Wei, M. (2025). Multi-View Three-Dimensional Reconstruction Based on Feature Enhancement and Weight Optimization Network. ISPRS International Journal of Geo-Information, 14(2), 43. https://doi.org/10.3390/ijgi14020043