1. Introduction
The photogrammetric reconstruction of 3D objects in a wide spectrum of applications in close range is currently performed based on point clouds generated by dense image-matching techniques. Some of the most commonly implemented methods are the commercial structure from motion (SfM), the scalable multi-view stereo (MVS) approach, and semi-global matching (SGM).
Photogrammetric approaches enable a highly redundant bundle block adjustment (BBA), simultaneous digital camera self-calibration, and automatic scene geometry reconstruction using image matching. The dense point clouds and advanced data processing allow recognition and complete reconstruction of 3D objects and then measuring and extracting geometric and semantic information [
1]. In recent years, a large number of fully automated photogrammetric software for georeferenced digital 3D reconstruction have been developed [
2,
3,
4,
5,
6,
7] These applications allow even non-expert users to generate 3D models for various purposes [
8]. with just a few mouse clicks. Despite remarkable progress in the “black-box” image-based processing pipeline, image preprocessing [
9], key points detection [
10,
11,
12,
13,
14] and description [
10,
11,
12,
13,
14,
15,
16,
17], matching [
18,
19], bundle adjustment (BA) [
20,
21,
22], and dense point clouds generation [
23,
24,
25], the accuracy of computation results still remains challenging.
In this connection, an important research objective concerns the evaluation of the quality and accuracy of image-based dense point cloud generation and processing aimed at 3D object reconstruction in close range. The applicability of image-based point clouds for the geometrical measuring of various objects is also very significant. Due to the properties of software and their hidden computing algorithms, researchers have been using many datasets for application evaluation [
4,
5,
6,
26] with regard to the merits and demerits of different case studies [
27,
28,
29]. Over the years, different datasets have been published [
30,
31,
32,
33,
34,
35,
36,
37,
38,
39].
The benchmark datasets used up to now have the following errors and deficiencies:
The omission of precise geodetic hardware and measuring and adjustment methods for the reference control network (control points and check points) determination.
Shortages in ground truth information in a range of use terrestrial laser scanning (TLS) point clouds without the description of the root mean square error (RMSE) of merging multiple scans.
Insufficient accuracy of 2D and 3D data information of the existing publicly available datasets.
The use of simplified descriptions in terms of processing and computation accuracy; in spite of the fact that the standard terms and metrics do exist [
40,
41,
42], they are not always correctly employed by all software packages, and researchers have been using them in benchmarks, what was highlighted by Remondino et al. [
5].
According to the authors’ knowledge, a publicly available close range dataset that allows a comprehensive and advanced software evaluation does not exist at present.
The proposed photogrammetric Close Range Benchmark DataSet, named CRBeDaSet, was thus designed to investigate the radiometric quality and geometric accuracy of a photogrammetric 3D reconstruction of a medium-size spatial object in close range, using dense point clouds generated from terrestrial multi-images.
The main contributions of this paper are as follows:
We provide an extensive review of the state-of-the-art datasets for 3D object reconstruction, ranging from datasets dedicated to algorithm evaluation to recently published special case studies.
We construct a new CRBeDaSet benchmark dataset, which consists of the geodetic, photogrammetric, and terrestrial laser scanning real measurement data; the CRBeDaSet (see Data Availability Statement) is publicly available and advanced in terms of high accuracy of measurement of data and adjustment, with a higher possibility of its use in evaluation than the currently existing and shared datasets.
We evaluate some applications for 3D modeling, reconstruction, and mapping on the CRBeDaSet under consistent experimental conditions for showing the assets of this dataset in software evaluation; this provides the literature with extensive baseline results for future research on digital processing in close range tasks.
We provide a comprehensive analysis of image matching of outdoor objects (4 elevations) with weak texture using multiple detectors and descriptors.
We prepared the real TLS point cloud, which is denoized, filtered, and classified using the level of detail 3 (LoD3) standard. It represents the extension of publicly available datasets (e.g., Semantic3D) used for machine and deep learning purposes related to point cloud classification and segmentation.
Furthermore, because of extended metadata about the acquired point cloud, the provided data may be used in TLS registration algorithms evaluations.
2. Existing Datasets and Benchmarks
Vision Middlebury (vision.middlebury.edu, accessed on 30 November 2022) [
33] should be acknowledged at the beginning of the publicly available data set analyses for algorithms and application evaluation. These data allowed researchers to perform the first attempts to evaluate MVS on equal grounds. The data sets are based on two small objects, each below 0.15 m in height. Both are covered by more than 300 low-resolution (640 × 480 pixels) images acquired by a controlled robotic arm. The interior and exterior orientations are included. The ground truth models were performed using a laser stripe scanner with a resolution of 0.25 mm. The relative accuracy, which is understood as the quotient of measurement error and the object diameter, obtained in the laboratory test was ca. 1/1000 for the temple object and ca. 1/600 for the dino object. The main problems with these data were unrealistic scenes and low image geometric resolution.
To eliminate these shortages, the CVLAB published three samples of building façades [
35,
43,
44]. The photos were taken outside the lab using Canon D60 digital camera with a higher resolution (3072 × 2028 pixels) than Middlebury. The images were supplemented by TLS (Zoller + Fröhlich IMAGER 5003 scanner) colored point cloud as ground truth data. This benchmark supported the development and validation of advanced reconstruction algorithms [
37,
38], but the small number of 3D scenes and variability of endeavors limit their scope of analysis and the conclusions that can be investigated from them. Furthermore, these datasets do not provide information that would allow calculating the relative adjustment accuracy.
To support further development within the MVS method, Aanæs et al. [
45] cover a data set with a wide range of indoor and outdoor scene types. All of the objects were captured with a tabletop and under controlled lighting. Eighty different objects with sequences of different luminous conditions, acquired from 119 positions, have been delivered with ground truth camera poses (controlled by the precise robotic arm) and TLS ground truth. The obtained relative accuracy was ca. 1/3300. Due to the limitations in the total number of objects, objects per scene, and the total number of scenes, the DTU Robot Image Data Sets were improved two times in 2014 [
30] and 2016 [
34]. Because the datasets are laboratory prepared, they provide mostly geometrically-ideal objects with a large number of simple details and are captured with still low resolution (1200 × 1600 pixels) and relative accuracy ca. 1/2000. In terms of application for 3D reconstruction evaluation, datasets are insufficient.
In case of accuracy assessment, the photogrammetric products obtained using remotely piloted aircraft systems (RPAS) should be noted. Most of the current studies related to this topic are realized in incomparable technical project conditions and using various statistical parameters for results evaluation [
46,
47] relating to the particular stages of the processing [
48,
49,
50,
51]. The most commonly used statistical information accuracy was the RMSE on ground control and check points (GCPs, ChPs), which characterized the BBA results. Mostafa, in his research [
52], states that changing the side overlap in the image block from q% = 80% to q% = 40% resulted in the same accuracy within the measurement noise. Furthermore, he indicated that ground object positioning accuracy is about 2 ÷ 3 ground sample distance (GSD) value, and height accuracy is about 4 ÷ 5 GSD value [
52]. Oniga et al. [
53] compared results with TLS data. Wierzbicki and Nienaltowski [
51] created the triangulated irregular network (TIN) model based on points measured using the Global Navigation Satellite System (GNSS) and real-time kinematic (RTK) technique, and they compared the results by looking for height differences.
For describing the accuracy of orthomosaics, the BBA results [
49,
54] are used mainly by researchers. Hung et al. [
55] and James et al. [
48] also analyzed the directions and length of deviation vectors for GCPs and ChPs from the BBA. In their study, Gabara and Sawicki [
56] present the procedure for the complex accuracy assessment, which includes all computation stages (feature descriptor extraction and matching, the bundle block adjustment with camera self-calibration supported by SfM, densification using MVS, meshing, and orthorectification).
Traditionally, photogrammetry has always focused on the evaluation of the accuracy and precision of the mapping. For this purpose, researchers created test fields, which allow comparing different measurement techniques. One of them is the test field created by the Institute for Photogrammetry of the University of Stuttgart [
57,
58] under the auspices of the German Society of Photogrammetry, Remote Sensing, and Geoinformation (DGPF). The objective of this test was to evaluate the sensor’s technical attributes and their relevance to the specific applications and investigate the software processing chain in the preparation process of photogrammetric products. Depending on the unmanned aerial vehicle (UAV) imagery target processing type and purpose, different types and configurations of test fields have been used to study.
A thorough calibration of the camera system mounted onboard RPAS platforms is significant and is performed most often in the laboratory using a small volumetric test field and planar pattern [
59] or the 3D test field [
60], which consists of spatially distributed, coded, and non-coded targets [
61]. In the data collection or mapping case, the test fields are mainly characterized by a small area and homogeneous surface [
62] or differentiated topographical terrain [
61,
63] with evenly located GCPs and ChPs. Due to the limited scope of field surveys, the used approaches usually had a small number of GCPs and ChPs [
46,
63,
64]. The test fields with an area of several square kilometers and varying terrain elevation and topography, as well as dense GCPs network requiring a large number of surveys, are, in practice, very rarely realized. A phenomenon is the test area presented in the work of Haala et al. [
65].
Inspired by these activities, the International Society for Photogrammetry and Remote Sensing (ISPRS), in collaboration with European Spatial Data Research (EuroSDR), provides cofounding for developing and managing a new image dataset, which was presented by Nex et al. [
66] at the ISPRS conference. The dataset is in three large test areas. It contains terrestrial and UAV imagery, nadir and oblique aerial photos, TLS, airborne laser scanning (ALS), and ground control network information. Since the aim of the datasets was not to compare the software or algorithms but to assess the accuracy and reliability of different measuring methods, the number of check points and theoretical precision of bundle adjustment does not allow for 3D reconstruction with high accuracy in the “mm” range. Additionally, the clear and unambiguously recognizable structure is not difficult to match by software dedicated to 3D reconstruction. The dataset was used and extended by Haala and Cavegn [
67] with an additional building scenario. They concluded that it is possible to derive point clouds at accuracy and resolution corresponding to the GSD of the original images (GSD ≅ 0.05 m).
Furthermore, for deep learning purposes (which is one of the main topics in case of accuracy), the main issue was also related to the specificity of high-resolution aerial images and the size of photographed objects, thus limiting data augmentation. The usage of image resizing, reshaping, blurring, and increasing noise is not allowed. Moreover, Mittal et al. [
68] have also pointed out the high number of occlusions, large-scale variations, and class imbalance.
At the Conference on Computer Vision and Pattern Recognition in 2017, Schöps et al. [
32] released the ETH3D Benchmark Dataset (
www.eth3d.net, accessed on 27 July 2022). It contains high and low-resolution training data (13 real indoor and outdoor scenes), test data (12 real indoor and outdoor scenes) taken by a digital single-lens reflex (DSLR) camera, and ground truth information acquired with TLS. The ETH3D covers general solutions that prevent overfitting of algorithms and give the first benchmark for hand-held MVS with consumer-grade cameras, but does not provide a ground control network for analysis of adjustment accuracy. At the same time as ETH3D, Knapitsch et al. [
69] were working with the Tanks and Temples Dataset (
www.tanksandtemples.org, accessed on 10 July 2022) and evaluating 15 reconstruction pipelines. For this purpose, the authors provide training (7 scenarios) and testing data (intermediate—8 scenarios, advanced—6 scenarios) of sculptures, large vehicles, house-scale buildings with outside-looking-in camera trajectories, large indoor scenes imaged from within and large outdoor scenes with complex geometric layouts, and camera trajectories. The included scenes are intended to stimulate the development of new approaches to 3D reconstruction and robust broad-competence systems.
In terms of surveying and photogrammetry, the dataset’s lack of accuracy, adjustment, and reliable information is evident. Besides, there is also a lack of datasets for machine and deep learning purposes. The researchers tried to get around this issue by using Google Earth images to feed artificial neural networks (ANN). However, because of the GSD and previous preprocessing stages of images, they cannot replace authentic images. Due to this, Gabara and Sawicki [
70] have recently prepared the test field Kortowo. The test area of about 2 square km was designed for an accuracy assessment on low-altitude-based photogrammetric data collection, with particular emphasis on evaluating respective stages of digital image processing and computation. Additionally, in 2021, the Institute for Photogrammetry (IfP) research team from Stuttgart University, in cooperation with international scientists [
71], published the first stage of a new benchmark (Hessigheim 3D—H3D) which is designed for 3D data analysis and evaluation, and ranks existing and emerging approaches for semantic segmentation. It provides a fully annotated dataset of a part of Hessigheim village in Germany acquired using an unmanned aerial system (UAS). The online repository contains low-altitude oblique images acquired simultaneously with LiDAR data and nadir images acquired with a couple-hour time shift.
There are more datasets focused on machine learning and deep learning purposes. However, in these cases, the dataset accuracy is not described, and their main aim is to provide the annotated, classified point clouds and meshes to scientists. The most known and used dataset is Semantic3D, proposed by Hackel et al. [
72]. The dataset contains over a billion points in 15 training and 15 testing sets, where one set is a real single scan station. The point clouds are classified using eight general classes. The main issues related to this kind of data are deficiency in denoizing (e.g., cars in a move) multiple single scans from different locations, and the representation of the objects is jagged. It is related to obstacles and moving objects. Furthermore, the density of points depends on the distance between the object and the sensor.
The Paris-Lille-3D dataset [
73] is acquired using mobile laser scanning, and it is focused on urban 3D point clouds. It contains three sections (Lille 1, Lille 2, Paris) with a total of about 143.1 million points and 2479 objects, divided into 50 (manualy labeled). Some classes are very similar, e.g., parked car and stopped car, scooter and motorbike, and some classes present the same object but in motion, e.g., mobile car and stopped car. Some of the classes are underrepresented (they occur only once) or have a low number of representations (fewer than 10). It should be pointed out that there is a description of how point clouds were processed, but the preprocessing results (accuracies) are not available.
Some datasets were performed using UAV imagery. The Campus3D dataset provided by Li et al. [
74] should be mentioned, as it is focused on the outdoor scene’s hierarchical understanding. The dataset contains a dense classified point cloud of 1.58 km
2 localized over the National University of Singapore campus. The 937.1 million points are divided into 14 classes using hierarchical and instance-based annotations. However, the accuracies of the image processing are also not available. A similar approach in data acquisition is presented in the SensatUrban Dataset [
75] where point clouds from three UK cities (about 7.6 square km of city landscapes) were computed using UAV imagery. The part of the dataset (4.4 square km, related to Birmingham and Cambridge cities) was manually labeled using 13 semantic classes. The labeled point clouds contain about 0.570 billion (Birmingham) and 2.279 billion (Cambridge) 3D points. The authors concluded that the BBA was made using direct georeferencing by RTK GNSS measurements, and ground control points validated the resulting coordinates acquired by surveyors using high-precision GNSS equipment. As the dataset aims to prepare possibilities of urban-scale 3D semantic segmentation, the accuracy of the photogrammetric processing of the data is not available.
Recently, Gao et al. [
76] developed a benchmark dataset of semantic urban meshes. The 3D reconstruction of about 12 square km based on oblique aerial images with a GSD of about 7.5 cm was processed using ContextCapture commercial off-the-shelf (COTS) software. The presented dataset contains over 19 million triangle faces labeled into seven classes. While the main description of the dataset is focused on semantic labeling purposes and automatic classification accuracy, the information about BBA is not presented.
The issue related to large-scale heritage point cloud semantic segmentation is covered by the ArCH benchmark dataset prepared by Matrone et al. [
77]. In the case of the detail representation, it is the most extensive and deep semantic segmentation described as LoD 3/4 in CityGML, which means an indoor and outdoor representation of cultural heritage buildings. The data were acquired using an integration of TLS and terrestrial and UAV imagery. They were labeled using historic building information modeling (HBIM) class topology (10 classes related to the historic buildings). Due to further processing by users, the ArCH benchmark dataset was subsampled (1 ÷ 1.5 cm distance between points), and shared data contains 103 million points (15 objects) for training and validation purposes and 32 million points (2 objects) for testing purposes. However, the data acquisition accuracy also is not mentioned in the study.
Special case study datasets for evaluating simultaneous localization and mapping (SLAM) systems, road detecting, and remote sensing image retrieval have also been developed. The accuracy estimation in SLAM is realized by two approaches: open-loop tests, which check the system’s performance in isolation, and closed-loop tests, used for the evaluation of the overall performance of the system. Both approaches are complementary for evaluating the accuracy of SLAM systems. Zhao et al. [
78] compared both SLAM tests considering their accuracy, robustness, and computational efficiency. The benchmarking of different types of SLAM algorithms [
79] using the measurements of the error of the corrected trajectory was proposed by Kümmerle et al. [
80]. The datasets with the most significant influence on research works are TUM RGB-D SLAM [
36], KITTI-ROAD [
31], and PatternNet [
39]. The freeware applications, commercial, and open-source multi-view software solutions were compared and evaluated for different aspects, but only for small artifacts and objects [
6,
26,
33,
81,
82,
83,
84].
In the case of medium and large sized objects, the efficiency of the matching techniques that detect and conform to structural regularities while simultaneously recovering 3D geometry was researched by Ceylan et al. [
85] and Grussenmeyer and Khalil [
3]. The accuracy and effectiveness of the 3D reconstruction of large objects using the photogrammetric approach were realized and evaluated using TLS data by Gagliolo et al. [
28], Koutsoudis et al. [
86], and Strecha et al. [
87]. The application purposes for monitoring in the context of geomorphological research were described using UAV by Gabara and Sawicki [
88] and Jaud et al. [
4].
The evaluation of the accuracy of multifaceted 3D building reconstruction with a focus on the level of detail and error sources that occurred during the modeling process was described using different validation datasets in publication [
89].
Depending on the authors’ aim, datasets focus on different sensors, a larger number of evaluated applications, a combination of different measurement techniques, and different reconstruction pipelines. In some dedicated scenarios, however, authors have drawn attention to the accuracy [
28,
56,
90] and inconvenient conditions [
27]. All the mentioned benchmarks showed the strengths and weaknesses of applications depending on the scenario and tools used for measurements. Due to this research, the computation pipelines implemented in photogrammetric software have been constantly upgraded. Simultaneously, the knowledge in a range of photogrammetry of non-expert users is increasing.
In the reviewed benchmarks, there are certainly deficits in terms of publicly available datasets and information about the theoretical precision of object coordinates and TLS station merging errors. The ISPRS Benchmark Dataset [
66], research of Gagliolo et al. [
28], Hessigheim 3D [
71], and Kortowo test area [
70] are perhaps closest to our CRBeDaSet in motivation: they focus on high-quality geometric data acquired in real conditions, and in the case of scale representation (LoD), the ArCH [
77] dataset is the closest to the presented dataset.
4. CRBeDaSet—Test and Analysis
The presented experiment scenario was designed to prepare the assessment of quality and geometric accuracy of a medium-size building 3D reconstruction based on photogrammetric, multi-image view, uncalibrated imagery. The second aspect of this dataset was to allow researchers to objectively compare the accuracy, functionality, and reliability of photogrammetric applications for 3D object reconstruction. Our test and analysis were divided into three stages:
The tests of the CRBeDaSet in applications dedicated to 3D object modeling and reconstruction.
The accuracy analysis of the image-based and TLS point clouds.
The evaluation of detectors and descriptors on the object with a homogenous structure.
4.1. Dataset Processing
For verification of the dataset, we used one open-source application: inteGRAted PHOtogrammetric Suite (GRAPHOS, ver. 1.0.35) and four well-known and very popular commercial suites: ContextCapture (ver. 10.19.00.122), Metashape (ver. 1.8.4.14856 and PhotoScan ver. 1.4.5.7354), Pix4Dmapper (ver. 4.7.5), and RealityCapture (ver. 1.2.1.116295). Due to changes in the computation algorithms in Agisoft software (ver. 1.5.0), both versions (Metashape and PhotoScan) were included in the tests. The workflow for this stage of analysis is presented in
Figure 7.
The digital processing in the tested software was performed using preprepared batch files to prepare a similar computation environment. These files contain image pixel coordinates of all points measured automatically in the external Pictran DE software (ver. 4.3) and photogrammetric control network (artificially signalized and natural points) ground coordinates. All computations were performed with minimal user participation with similar and possible-to-use computing parameters. The results of three computation stages (image-matching, densification, and meshing) were compared using visualizations and descriptions (in the form of tables) to show computation differences. The results of the image matching stage and BBA of CRBeDaSet processing in five applications dedicated to photogrammetric purposes are presented in
Figure 8. While in the case of ContextCapture and PhotoScan, most of the tie points are localized on the main object, in the case of Pix4DMapper and RealityCapture, many tie points are background points (visible noise) containing grass, shrubs, trees, cars, and steel-glass elevation of the second building, which is also visible on GRAPHOS and Metashape results.
Figure 9 shows the results of point cloud densification (based on the ContextCapture example). The first image shows the general look of the reconstructed region. The next images present detailed views on roof and façade geometry and point distribution.
The comparison of all software results is presented in
Figure A1 (
Appendix A). The fulfillment of the points in the reconstructed region (presented in row 1) is different, from dense point clouds with visible holes (GRAPHOS case) to fully covered objects (ContextCapture case). Pix4DMapper results look unusual—all regions without visible texture were not reconstructed in the densification step using high resolution (it is also possible to obtain these regions by using the user’s dedicated parameters). The roof reconstruction shows how the application can handle the sky registered on images and how such points are filtered in the final product. Rows 3 and 4 show how applications reconstruct places where the intersection of two planes occurs if the corners are rounded or present a proper 90-degree shape and how much noise is generated. Row 5 deals with the connection between different elevation objects and shows how not fully visible objects are reconstructed (based on snow fences for roof tiles). The last row (6) presents what the geometry of reconstructed walls is (including places without visible texture).
Figure 10 shows the results of mesh model generation (based on the ContextCapture example). The detailed comparison of the results from used application is presented in
Figure A2 (
Appendix A). Due to the fact that GRAPHOS does not have a built-in feature to generate mesh geometry, it is not included in the collation. All generated mesh 3D models filled up holes presented on dense point clouds. In the case of Metashape, PhotoScan, and Pix4DMapper, some visible parts of the sky on the roof which were not filtered occurred. On all models, the clinker tiles’ shape and wall texture are visible; however, in the case of Metashape and PhotoScan, the microporous geometry of the mesh model is visible. The place where roof and wall planes meet is generating issues in every application, from a slight rounding of the corner (ContextCapture, RealityCapture) through some noise production (PhotoScan) and generation of extrapolated geometry (Metashape and Pix4DMapper). The Pix4DMapper fulfilled holes in walls without the visible texture of the mesh model; however, because of the variation of Poisson surface reconstruction algorithm [
91], they are not straight, but rounded. Avoiding such issues (using planar constraints) is possible, but it requires the user’s manual interference. The geometry of walls show that the ContextCapture issue related to the wrong reconstruction of bigger parts of the wall without visible texture (extrapolating objects—cleating craters on walls in the direction to inside the building) is not filtered out during mesh model computation. Due to the good texture of the mesh model, the geometry imperfections appear to be hidden in other visual presentations.
The summary of the generated products is presented in
Table 3. As the computations were performed on two workstations (laptop MSI Titan GT77 12UHS and desktop described in [
56]), the computation time is not compared.
4.2. Accuracy Analysis of CRBeDaSet Digital Processing
The analytical study on the geometric accuracy of the generated image-based point clouds includes the analysis of deviations (coordinates differences) between the control point coordinates of the photogrammetric network, measured directly by means of the Leica TS30 total station (the input dataset) and the results (the final software log files) of digital processing and bundle adjustment using the tested software. The geometric analysis workflow is presented in
Figure 11.
The summary of bundle block adjustment computed in tested software is performed by the comparison of the RMS values on GCPs (presented in
Table 4) and ChPs (presented in
Table 5), considering targeted and natural points separately and together.
The tested COTS software allowed point matching on all control points with the subpixel accuracy—the mean standard deviation sx’y’ < 0.50 pixel was obtained. The average root mean square errors on control points RMSE (XYZ) GCPs = 3 mm and check points RMSE (XYZ) ChPs < 3 mm were determined, which is equivalent to a relative accuracy of ca. 1/6600 in object space. Considering the applied functional model of digital processing, the obtained results can be accepted as sufficient. In the case of the GRAPHOS suite, the obtained results were higher. It is related to the computation chain because the coordinates of GCPs and ChPs are used in BBA after matching and camera calibration procedures to scale the model. While in the COTS software it is possible to compute the whole model based on provided GCPs and ChPs, in the GRAPHOS suite, their influence on the model is less, and correct automatic tie points must be computed.
In the second part of the geometric analysis, TLS point cloud was used. The root mean square errors obtained in the Leica Cyclone application during the registration using targeted points localized on the geodetic control network were the following: RMSE(X) = 2.0 mm, RMSE(Y) = 3.8 mm, and RMSE(Z) = 1.8 mm. The accuracy of the TLS point cloud was also evaluated using manual measurements of signalized photogrammetric control network points centers and natural ChPs localized in the corners of ventilation niches and doorways (measured using elevation planes cut from the point cloud, see
Figure 12a).
Figure 12b shows how the signalized targets were registered in TLS point cloud, and
Table 6 contains the RMSE values for measured points.
While TLS scan stations were registered with high density, which was very time-consuming, it may be observed that one target used for merging scans moved and influenced the elevation 2. The target was eliminated from registration; however, the merged point cloud had bigger registration errors on the elevation 2 (E2) scan. Due to that, the target measurements were divided into two groups with elevation two and without. Considering the manual measurement error and TLS point cloud density, the obtained RMSE values on sCPs and nChPs (excluding E2) are acceptable for geometric analysis purposes.
Because additional structures with signalized ground control points were used for image acquisition, on all comparisons with the TLS point cloud, these objects (localized near the building) are signalized as a gross error. Elevation 2 shows distances between point clouds which are similar to the TLS registration error. Because of the shape and size, the snow fence for roof tiles is reconstructed with errors in all applications. Additionally, the shape of the roof tiles creates different issues in reconstruction. In all applications, the drainpipes’ correct shape is an issue; however, the errors in the areas near vertical drainpipes are mainly caused by laser-beam slipping. In the case of ContextCapture, the results showed large deformations in bigger parts of the façade where no visible structure is localized. The places where walls stitch with the roof (90-degree corners) look rounded. The repeated shape of the ventilation grate showed reconstruction errors below 25 mm. The GRAPHOS case shows issues related to the side parts of the façades. While in the center parts, distances to TLS are below 5 mm, on the sides, they rise to more than 50 mm in the case of elevation 3 and 25 mm for elevation 4. Analyzing Metashape and PhotoScan, the noise filtering should be pointed out. It is visible that some efforts were made to cut off the points, which are reconstructed using images where clouds are registered near the roof. Still, there are some issues with rounding corners, which are mainly visible on elevation 4. The Pix4DMapper case shows that the neighborhood of the building is more deformed than in other applications, but it is below 25 mm. Due to the reconstruction being limited to the visible features and shapes of the façade, the deformation of glass brick windows has not occurred. Some changes are visible on ventilation grates, i.e., elevation 1 (errors below 25 mm) and 3 (errors rise to more than 100 mm). The roof of the building consists of slightly higher deformations than in other applications. The RealityCapture case shows good performance on walls without visible features and roof structure; however, ventilation grates and glass brick windows showed deformations below 50 mm (elevation 4).
4.3. Evaluation of Detectors and Descriptors on CRBeDaSet
While the microporous texture could be recognized in the terrain, the image registration shows that it should be treated as a surface without visible texture for matching purposes. Due to that, the third part of the evaluation is related to different detectors and descriptors. Tests were performed for four elevations (two more extended and two shorter) which contain different characteristics. The selection of images on which the tests were carried out was related to the simulation of real measurement conditions in which 100% longitudinal overlap does not occur. The workflow for this part of the study is presented in
Figure 14.
In case 1, the surface of the wall of elevation 1 (E1) without visible texture is limited by four big rectangular ventilation grates with repeated shapes, a door with a small information board, drain pipes, and some small objects (technical infrastructure) with visible shapes. The down part of the façade is covered by clinker tiles. Besides, on the images are registered ceramic roof tiles (repeated shape), snow fences for roof tiles, paved terrain, trees, and technical infrastructure. The second elevation (E2) contains less wall surface without texture than E1 and includes more doors with information boards, more small technical infrastructure, and two glass brick windows. The third elevation (E3) has many objects with visible shapes, i.e., three double doors with information boards and rectangular ventilation grates (down and central part of doors), a big rectangular ventilation grate near doors, and technical infrastructure. The wall surface without visible texture is very limited. The fourth elevation (E4) includes four glass block windows and two windows with closed built-in roller shutters and more technical infrastructure, as well as more surface without visible texture. The tests were performed using 12 detectors and 14 descriptors (all to all method) using the GRAPHOS suite [
7], PhotoMatch software [
19], and our own implementation of Python.
The results of different detectors are presented in
Figure A8 and
Tables S1–S4 of supplementary files (for four elevations), while a short description is presented in
Table A1,
Table A2,
Table A3 and
Table A4. The images in high resolution are included in the CRBeDaSet_tests online repository.
The outcome of the different detectors and descriptors was divided into three groups based on the number of errors during the matching. The groups were defined as follows (Equation (10)):
where
is number of matched points,
is number of wrongly matched points, and
is a matching result. The results of matching tests are presented in
Table S5 of the supplementary file in the form of images with marked points and in the form of tables (
Table A1,
Table A2,
Table A3 and
Table A4 of
Appendix B) using the description from Equation (10). Furthermore, the detection and description time, number of keypoints, and number of matched points for short (E3) and long (E4) elevations are presented in
Tables S6–S7 of the supplementary file. Based on the division of matching results from four elevations (
Table A1,
Table A2,
Table A3 and
Table A4 of
Appendix B), the analysis showed which descriptors and detectors might be used in the case of buildings with large parts of walls with no visible features (
Table 7).
Furthermore, based on the number of matches, the detectors (including all used descriptors for which the matching result is positive) might be classified into three main groups:
Below 20 matched points (GFTT, MSD, ORB, SIFT, STAR).
Between 20 and 100 matched points (BRISK, KAZE).
Under 100 matched points (AGAST, SURF).
In the case of FAST, AKAZE, the results depend on the object features registered on images and due to that, both are between groups 2 and 3. The MSER detector is classified between groups 1 and 2. While for group 3, most of the points are localized on clinker bricks, for group 1, points are mainly localized on information boards, corners of ventilation, and window and door niches. An interesting case is the MSD detector with VGG and SIFT descriptors, which show matched points mainly near artificially signalized ground control points. Considering descriptors, the case of BOOST shows a visible smaller number of matched points, and usage of the LSS is the most time-consuming.
5. Discussion
A new CRBeDaSet benchmark dataset is presented for evaluating image-based 3D modeling and reconstruction techniques. The high-accuracy survey measurement instruments, as well as geodetic and photogrammetric professional software, were used to prepare the analytical data of the CRBeDaSet because only high-quality and accurately-input data provides a reliable evaluation of the algorithms and applications’ quality.
The first practical experiences of using the CRBeDaSet to assess the accuracy and reliability of tested applications (ContextCapture, GRAPHOS, Metashape, PhotoScan, Pix4Dmapper, and RealityCapture) confirmed the usefulness of the prepared real data. The CRBeDaSet has a number of characteristics that can support the development of new approaches to the high-accuracy 3D reconstruction of medium sized objects (object shape with a volume ca 1000 m
3) in close range. The development of applications dedicated to photogrammetric reconstruction is still in high demand. Based on the presented results and previous works [
26], it can be seen that developers in recent years put some effort into noise filtering, smoothing of the reconstructed surfaces, and computation time. Still, some computation errors might occur during the matching part, which could be caused by imperfections of software, e.g., the ineffectiveness of algorithms or their incorrect implementation.
Due to the presented test object in CRBeDaSet being composed of paving, walls covered partially with bricks, multiplicated objects with sharp shapes (ventilation grates), and homogenous texture of a wall and roofing tiles, there should be enough features to process the data. However, during the processing of the presented dataset, the following difficulties occurred:
Faulty matching of tie points is related to a brute force approach where all possible image pairs are evaluated (
Figure 15a).
Problems with the optimal selection of tie points and weight assignment in the adjustment process (
Figure 15b–d).
Matching problems related to repeated regular shapes and colors of roofing tiles and bricks (
Figure 15e), windows, and vent holes’ localization near the homogenous texture of the wall (
Figure 15f).
Shortages in point clouds (
Figure 15g) and mesh discontinuities on projection distorted regular parts of the roof.
Some of the tested software has difficulty with the correct automatic filtering of dense point clouds during the mesh generation (
Figure 15h).
The first three errors could be resolved by using different matching algorithms, and due to that, we have prepared a comparison of the detectors and descriptors. Based on the tables (
Table A1,
Table A2,
Table A3,
Table A4 and
Table S6) and images (
Tables S1–S5), it is possible to choose algorithms that cover developers’ assumptions. The process deals with a large number of tie points which might result in errors, as presented in
Figure 15b–d), or a limited number of tie points to acquire shorter computation time with a smaller number of matches. In our study, to perform tests without manual tuning of detectors’ and descriptors’ parameters (i.e., feature number, octave layers, thresholds), the standard values were used. By tuning these factors, different results might be obtained.
The second part of the dataset is related to TLS point cloud registration and semantic segmentation of point clouds. The presented CRBeDaset contains a complex photogrammetric control network, which might also be used for scan station registration. In recent years, the topic of TLS point clouds registration has been explored, and according to Li et al. [
94], there are three ways to increase the accuracy of the computation. Firstly, by improving filtering algorithms; secondly, by developing or enhancing feature descriptors; and thirdly, by making progress in 3D surface matching. Our CRBeDaSet allows scientists to work with the first two mentioned ways, and already-performed computations might be used as a reference. Based on our proposed dataset, other researchers might train their machine learning algorithms (i.e., self-organizing maps or K-means) to filter noisy point clouds and reach better accuracy of the final 3D model [
95], which is very important, especially in cultural heritage studies.
Furthermore, in the second part of our dataset, the LoD3 classification topology of the 3D outdoor point cloud was used. It is important for future development of the semantic segmentation algorithms because the indoor scenes in LoD3 class topology are available; however, there is a deficit of outdoor scene availability.
The dataset can be accessed via the DOI repository website, CRBeDaSet:
http://dx.doi.org/10.17632/9nvn27yw62.2 (images, coordinates, and raw merged TLS point cloud) and CRBeDaSet_tests:
http://dx.doi.org/10.5281/zenodo.7496481 (raw scan stations, filtered point cloud in control network coordinate systems, annotated point clouds, high-resolution images, analyses of point clouds obtained using COTS software, TLS to image-based point cloud distance results, and matching parts).
Future investigations will be focused on developing and increasing the CRBeDaSet by dedicating scenarios for the highest geometric accuracy 3D object reconstruction using precise geodesy surveying for control networks and direct georeferencing, multi-view terrestrial and UAV imagery, corridor mapping on GPS-denied areas, and, additionally, more TLS scan stations.