1. Introduction
An underground utility tunnel is an underground structure in which major supply facilities of daily life resources, such as electricity, communication, and water supply, are installed in a downtown area [
1]. Underground utility tunnels, which serve as a major supply chain in a downtown area, require safety management to prevent events such as pipeline rupture or fire due to its nature and positioning. However, maintenance and management of these facilities is difficult because of the of the long and narrow characteristics of underground utility tunnels. To address these issues, numerous studies on digital twin technology for efficient monitoring are currently underway. Digital twin technology builds a virtual model of the real world, reflecting the shape and location of objects. A safety-and-disaster digital twin model must reflect the changes in the state of objects and spaces in the real world, such as cracks in structures, fire outbreaks, and fire-extinguisher-location changes. Underground utility tunnel ducts are mainly monitored through closed circuit television (CCTV) cameras, and many studies have been conducted using artificial intelligence (AI) technologies to determine their shapes. However, studies that focus on the location of changed objects on digital twin models are limited. Technologies using stereo and depth cameras are widely used to determine the location of changed objects; however, changing CCTV cameras in underground utility tunnels installed in long extensions is expensive and time-consuming.
Moreover, a monocular camera image such as a CCTV camera image lacks feature information related to the size of an object. Therefore, various algorithms using deep learning have been proposed, and their accuracy has been sufficiently improved to obtain image depth close to the results obtained from binocular images. The goal of this study is to identify the applicability of these algorithms in actual underground utility tunnels. In this study, the DenseDepth transfer learning method, which uses depth image of light detection and ranging (LiDAR) depth data, and the coordinate system conversion method through floor plan projection were used to estimate image depth, and their results were compared. Therefore, it was possible to derive an appropriate method for estimating distance using a single camera in an actual underground utility tunnel. This study contributes to solving the problem of reflecting the position of an object in the real world to the digital model.
2. Related Research
The technology of estimating real distance using an image is utilized in various fields of computer vision, such as autonomous driving. Typically, laser scanning provides high-accuracy location information, but its application in real environments is expensive.
To solve this issue, a stereo-matching method was used to estimate the location of objects from acquired images using two or more cameras. The stereo matching method has been extensively studied in the field of computer vision, and studies were conducted to use a calibration process to make two cameras in the same parallel operate in a stereo manner [
2,
3,
4,
5]. However, cameras used for stereo imaging have different postures and different focal lengths. In addition, there are differences in brightness according to the lighting of the two places where the camera is installed, or differences in brightness in an image according to the exposure of the aperture. In addition, various conditions cause problems that are difficult to correct [
6]. To address these problems, many studies [
7,
8] have been conducted. In particular, an algorithm based on a convolutional neural network (CNN) [
7] was presented, and various stereo matching algorithms were announced [
4,
8,
9].
Underground utility tunnels are narrow and long security facilities, and it is difficult to install additional cameras for stereo matching, which is a method for estimating depth. Therefore, in this study, depth-estimation research using a single camera was investigated. The stereo matching method is a system that imitates the human eye and can estimate depth information from monocular cues, such as shadow and texture, using only a single image [
10]. Although studies have been conducted to estimate depth through object outlines [
11], segments, and shadows [
12,
13], reliable depth-estimation information cannot be obtained because it relies on several clues and rules. Numerous studies have been conducted to estimate depth with only a single camera using a CNN. CNN-based single-image depth-estimation research was first proposed by [
14], and neural networks were used to precisely improve approximate depth information. Various new methods have been proposed to improve the performance of single-image depth estimation [
15,
16]. Studies using the regression loss function to improve the speed and performance of learning models [
17] and studies on modeling methods for more accurate mapping of unclear mapping relationships between single images and depth maps using residual learning [
18] have been conducted. New methods have been presented for single-image depth estimation. In addition, a method of improving loss function for faster in-depth estimation accuracy and quality-improvement learning based on transfer learning was proposed [
19]. Most depth estimation algorithms are based on supervised neural networks, but research has been conducted to address the disadvantages of the existing unsupervised depth estimation methods by introducing a knowledge-distillation-based approach [
20].
Numerous studies have been conducted proposing a method for measuring distance using a survey method for single-image depth estimation without applying a CNN model. A study [
21] was conducted using triangulation for indirect estimation of the distance to an object based on known planes and angles, without direct measurement.
As depth estimation via a single image is computationally expensive, a study was conducted to match the relationship between the image projection point and the corresponding point [
22]. It tracked the new absolute position, proposed an error-correction algorithm, and presented a model that maps 3D spatial points to 2D images.
Although studies on various single-image depth-estimation methods have been conducted, only a few studies have directly compared, tested, and applied them to the actual field. In this study, among the single-image depth-estimation methods, methods applicable to CCTVs installed in underground utility tunnels were selected and compared. To this end, we attempted to derive a model suitable for an underground utility tunnel by comparing the single-image depth-estimation method using a CNN and the method of mapping a 2D image and a 3D model.
3. DenseDepth Transfer Learning-Based Depth Estimation
3.1. Depth Measurement Based on a Single Camera
Prior to extracting three-dimensional (3D) spatial information using AI, on-site investigation is required, because data must be taken directly from the underground utility tunnel. Therefore, an actual underground utility tunnel was visited and data were collected with a low-light camera and a thermal-imaging CCTV camera, as shown in
Figure 1. In the image data, actual measurement information is displayed at 10 m intervals from 10 m to 50 m, such that it can be referred to the comparison of depth-estimation results with images.
3.2. Using Laser Scan Data from DenseDepth
DenseDepth is a monocular-image depth-estimation project implemented by Ibraheem Alhashim and Peter Wonka through transfer learning and is open at Github.
Figure 2 shows network structure composed of an encoder–decoder model; the encoder is the pre-trained DensNet-169 architecture and the decoder is composed of basic blocks of a CNN layer.
To create a model for estimating the value of the depth of two-dimensional (2D) image coordinates, training was performed using the original RGB image and an image whose depth value was given as a label. A supervised learning method was used, which estimated the depth value of the RGB image and compared it with the actual depth value of the labeled image to update its weight. DenseDepth basically provided two learning models, the KITTI road driving dataset and the NYU indoor dataset. The KITTI model, which has a similar image shape, such as the vanishing point and maximum distance of the underground utility tunnel, was selected as a test model. The upper part of
Figure 3 is the original RGB image provided by KITTI, and the lower part is the image obtained by converting the corresponding LiDAR distance data into depth value labels. The image size is 1242 × 375 pixels, and the depth range is from 1 m to 80 m.
3.3. Outdoor Road-Depth-Extraction Test Results
To determine the reliability of the h5 model of the KITTI road-driving dataset learned using Keras, a deep learning library, a test was conducted on an actual outdoor road.
Figure 4 shows the depth-value contours extracted from images taken on roads in Korea. After placing an object with a known actual distance from 10 m to 50 m, the error with the depth value extracted through the DenseDepth model was calculated on the image.
Table 1 lists the error values according to the distance from the camera to the object on the road. A root mean squared error (RMSE) value of 4.17 was obtained. Hence, the KITTI model was considered sufficiently reliable in the depth-extraction performance of the road; therefore, it was applied to the underground utility tunnel image.
3.4. Underground Utility Tunnel Depth-Extraction Test Results
Similar to the depth-extraction test of the road, the depth-value-extraction test of the underground utility tunnel image was conducted using the KITTI h5 model. A test image was used with a size of 1920 × 1080 pixels taken with a low-light camera from a specific underground utility tunnel in Korea.
Figure 5 shows the contour image of the depth-value extraction, which is the result of the test.
Table 2 shows the depth-extraction-value error for each underground utility tunnel distance obtained using DenseDepth.
Unlike the depth-extraction test result of the road image, the error of the test result of the underground utility tunnel was extremely large. This may have been caused by differences in the location of the vanishing point and the RGB color distribution at each boundary line in the underground utility tunnel image compared to the road image used when creating the learning model. Therefore, the KITTI model was considered unsuitable for extracting the depth of the underground utility pit. A method of directly scanning the underground utility tunnel and creating a dedicated depth model for the utility tunnel requires the use of laser equipment, which can result in additional costs. Thus, a method of extracting depth values through a coordinate system conversion algorithm was applied.
4. Coordinate System Conversion Algorithm-based Depth Estimation
4.1. Conversion of 2D and 3D Coordinate Systems
Coordinate system information is required to obtain 3D information from a camera image. The 3D real-world (X, Y, Z) coordinates pass through the camera lens and are projected onto the image sensor, creating an image consisting of 2D (x, y) coordinates. In this case, when only an image is given without 3D information, a single 3D real-world coordinate that matches the image coordinate cannot be obtained theoretically. This is because the Z value, which is the distance from the camera, is unknown. Therefore, in this study, by fixing any one of the X, Y, or Z axes to a specific value, the 3D world was reduced to a 2D plane, and the 2D image coordinates were back-projected to determine the point of intersection with the plane. Through this procedure, a depth value corresponding to the Z-axis could be extracted, based on the camera coordinate system. However, the (X, Y, Z) 3D coordinate system was assumed to be a completely empty space without any objects.
When viewed without considering three-dimensional objects such as a pipeline, the underground community sphere has the shape of a cube with a long longitudinal section. This cube-shaped structure, assuming an empty space, can later be mapped with a standard grid to represent the regions where anomalous signals occur in the digital twin space.
Figure 6 is a conceptual drawing of the standard grid superimposed on the underground common area.
4.2. Construction and Surveying of Environment for Image Mapping (World Coordinate System Setting)
Coordinate systems in a 3D space can be divided into two major categories. A camera coordinate system is based on the camera that acquired the image, whereas a world coordinate system is based on an arbitrary point in the real world. The camera coordinate system has the center of the lens as the origin, the horizontal and vertical directions of the image are the X and Y axes, respectively, and the direction to the principal point is the Z axis. It is automatically determined according to the camera position and direction. In contrast, the world coordinate system can be arbitrarily set as X, Y, and Z axes as well as the origin, and it is commonly set inside the space where the camera image was acquired.
Figure 7 shows the world coordinate system set in the test environment. The bottom of the ground was set as the X-Y plane using a flat point. A license plate with black numbers written on a white background consisting of actual measurement information was placed on the floor to facilitate the image coordinate calculation.
Figure 8 shows the actual measurement information of the floor plan world coordinate system and the CCTV camera. The height of the camera was 1 m 30 cm, and the distance between the camera and the number plate in the first row was 5 m 60 cm. The number plates were 3 × 3 in size, 5 m 60 cm wide, and 21 m 40 cm long, with a total of nine plates placed at intervals.
4.3. Camera Calibration
To derive the conversion formula between the camera and world coordinate systems, several internal parameters of the camera must be known in advance. They are the focal length (, ) and the principal point (, ) related to lens and image sensor information, and radial (, ) and tangential (, ) distortion coefficients related to distortion. Obtaining these parameters is called camera calibration.
Camera calibration is performed by geometrically interpreting an image taken from a 3D or 2D calibration object having a large number of prominent feature points. The calibration object mainly used is a chess board in which black and white squares appear alternately. The [GML Camera Calibration] Toolbox provided by the ‘Graphics and Media Lab’ provides calibration results by applying OpenCV, which implements Zhang’s thesis [
24], to multi-view images of the same chess board.
Figure 9 shows a chessboard image for the internal calibration of the SNO-6084R CCTV camera. The square of the chess board forms a pattern with a square of 3 cm in width and length, and corner points with a size of 9 × 6 can be detected.
Table 3 lists the internal factors and distortion coefficients of the SNO-6084R CCTV camera obtained by internal calibration using chess board images. The internal factors
,
,
and
are measured in pixels and the distortion coefficients
,
,
and
are constants.
4.4. Removing Camera Distortion
Due to the structural characteristics of cameras, distortion occurs when capturing images. Distortion types include radial and tangential distortions related to the lens and the image sensor, respectively. Radial distortion is a phenomenon in which light rays that pass farther from the center of a lens are considerably bent, appearing mainly in the form of barrel distortion. Tangential distortion occurs during the camera manufacturing process and is caused by the lens and image sensor planes being not perfectly in parallel, but with an inclination. Other types of distortion may also occur, but as the size is relatively large for radial and tangential distortions, distortion correction is mainly aimed at these. To obtain accurate back projection from image coordinates to 3D coordinates, this distortion correction must be preceded. Correction of both radial and tangential distortions can be performed using coordinates obtained by projecting image coordinates onto a normal coordinate system.
Equation (1) is a model that is converted from coordinates (
,
) with no distortion on the normal coordinate system to coordinates (
,
) with radial distortion applied according to the lens model.
,
are radial distortion coefficients obtained in the camera calibration step, and r is the radius from the optical center of the normal coordinate system. If the distortion is severe, a term
may be added to the expression in the form of a Taylor series.
Equation (2) refers to the tangential distortion model and tangential distortion coefficient.
Distortion is usually proportional to the size of the field of view of the camera. The angle of view of the SNO-6084R CCTV is approximately 105°, which is smaller than that of a wide-angle camera that shoots a wide angle of view of 120° or more, and the degree of distortion is greater than that of a web camera with a field of view of approximately 80°. The picture on the right of
Figure 10 is a distortion-corrected image using Equations (1) and (2). Even with the naked eye, the degree of distortion is not sufficiently small to distinguish the difference before and after correction. As a difference of only a few pixels can considerably change the coordinate conversion calculation result, the distortion correction procedure must be performed.
4.5. Transformation Matrix between Camera and World Coordinate Systems
If the internal factors and distortion coefficients are obtained through camera calibration, the transformation matrix between the camera and world coordinate systems can be obtained. The transformation matrix is composed of a rotation matrix and a movement vector and is represented as
when it is rotated by 𝜓, 𝜑, 𝜃 angles around the x, y, and z axes, respectively, in three dimensions. In the case of a movement vector, it expresses the movement of the center position along the x, y, and z axes. That is, it can be viewed as an offset between the origins of two different coordinate systems. When the transformation matrix is known, a point
in the world coordinate system can be converted into a point
in the camera coordinate system through Equation (3).
Table 4 presents the conversion matrix between the camera and world coordinate systems obtained using the solvePnP function of the OpenCV library.
4.6. Implementation of the Camera Image and 3D Model Space Mapping Automation SW Module
If the focal length (the distance between the camera lens and image sensor) is known, the projection conversion relationship between the camera sensor coordinates and world coordinates can be expressed in a homogeneous coordinate system, as shown in Equation (4) [
25].
For
and
in Equation (4), the values of the external transformation matrix obtained in
Table 4 are used. The focal length
is the physical distance in mm. In contrast, the focal length
,
an internal factor obtained in the camera calibration step, is a value in pixels obtained by multiplying the actual focal length by the pixel size per mm of each image sensor element. Therefore, if the actual focal length
is unknown, it can be obtained by multiplying the value
or
obtained through camera calibration by the length in mm per pixel of the image sensor. In this study, substituting 1 for the
value in Equation (4), (
was used as a coordinate on the normal coordinate system at 1 mm from the camera.
If both sides of Equation (4) are divided by
and each term is rearranged based on the X and Y unknowns, the equation can be re-expressed in the form of a binary linear system, as shown in Equation (5). If the
value is then set to 0 and calculated, the intersection point with the floor plane can finally be obtained.
4.7. Test Results
To test the accuracy of the 2D → 3D mapping conversion formula, the image coordinates of the corner points of the license plate were projected onto the floor, and the intersection point world coordinates (X, Y, Z) were obtained and compared with the known world coordinates.
Figure 11 shows a 3D coordinate conversion result image for the nine corner points of the license plate used in the test.
Table 5 lists the errors calculated by the Euclidean distance
. The average error of the nine points was about 105 cm.
5. Result and Discussion
A great deal of research has been conducted on methods of estimating depth, but not much research has been conducted to compare methods by applying them to actual environments, especially underground utility tunnels. Therefore, we compared two methods of estimating depth with a single camera.
In the DenseDepth transfer learning method, an error of 1 m occurred at a distance of 10 m, and an error of 4 m occurred at a distance of 20 m. However, the coordinate system transformation method through floor plane projection produced an error of 23 to 27 cm at a distance of 5 m 60 cm and an error of 30 to 112 cm at a distance of 21 m 40 cm, resulting in significantly better accuracy than that of the DenseDepth transfer learning method. Therefore, the coordinate system conversion method through floor plane projection can be applied to estimate the position of an object when the position of the object undergoes change, when viewed through an installed CCTV comprising a single camera in an underground utility tunnel.
However, based on the research results, it can be confirmed that the coordinate system conversion method has a large error range, even at the same distance. In order to determine this reason for this, it will be necessary to measure the distance of more points and to conduct further research.
6. Conclusions
The purpose of this study is to compare methods for measuring the depth of an object using a single camera image for application in locating real-world objects through digital twin models of underground utility tunnels. Therefore, the DenseDepth transfer learning method and the coordinate system conversion method through floor plane projection were selected and compared. The coordinate system conversion method was found to be more suitable. This study is meaningful as a study that applied and tested depth-estimation technology that can utilize CCTV, a single camera installed in the underground utility tunnel, to link objects in the real world to digital twin models. Further additional follow-up studies, such as the application of the coordinate system conversion method to actual underground utility tunnels and reviews of other methodologies, are required.
Author Contributions
Conceptualization, S.P. and I.H.; methodology, C.H. and I.H.; investigation, J.L., S.P., and C.H.; data curation, S.P.; writing—original draft preparation, S.P.; writing—review and editing, S.P.; All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT, MOIS, MOLIT, MOTIE) (No. 2020-0-00061, Development of integrated platform technology for fire and disaster management in underground utility tunnel based on digital twin).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Kim, H.; Lee, M.; Jung, W.; Oh, S. Temperature monitoring techniques of power cable joints in underground utility tunnels using a fiber Bragg grating. ICT Express. 2022, 8, 626–632. [Google Scholar] [CrossRef]
- Faugeras, O. Three-Dimensional Computer Vision: A Geometric Viewpoint; The MIT Press: London, UK, 1993. [Google Scholar]
- Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
- Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2017, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Zhang, F.; Prisacariu, V.; Yang, R.; Torr, H.S.P. GA-Net: Guided Aggregation Net for End-To-end Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Kim, J.; Park, J.; Lee, J. Comparison Between Traditional and CNN Based Stereo Matching Algorithms. J. Inst. Control Robot. Syst. 2020, 26, 335–341. [Google Scholar] [CrossRef]
- Park, J.; Song, G.; Lee, J. Shape-indifferent stereo disparity based on disparity gradient estimation. Image Vis. Comput. 2017, 57, 102–113. [Google Scholar] [CrossRef]
- Zbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 2016, 17, 1–32. [Google Scholar]
- Chang, J.R.; Chen, Y.S. Pyramid stereo matching network. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2018, 5410–5418. [Google Scholar]
- Zhuo, W.; Salzmann, M.; He, X.; Liu, M. Indoor scene structure analysis for single image depth estimation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Lee, D.C.; Hebert, M.; Kanade, T. Geometric reasoning for single image structure recovery. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Hoiem, D.; Efros, A.A.; Hebert, M. Geometric context from a single image. In Proceedings of the Tenth IEEE International Conference on Computer Vision, Beijing, China, 17–21 October 2005. [Google Scholar]
- Zhang, R.; Tsai, P.-S.; Cryer, J.E.; Shah, M. Shape-from-shading: A survey. IEEE Trans. Pattern Anal. Mach. Intelligence 1999, 21, 690–706. [Google Scholar] [CrossRef]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 2014, 3, 2366–2374. [Google Scholar]
- Liu, F.; Shen, C.; Lin, G. Deep convolutional neural fields for depth estimation from a single image. CVPR 2015, 5162–5170. [Google Scholar]
- Kim, Y.; Jung, H.; Min, D.; Sohn, K. Deep Monocular Depth Estimation via Integration of Global and Local Predictions. IEEE Trans. Image Process 2018, 27, 4131–4144. [Google Scholar] [CrossRef] [PubMed]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. CVPR 2018, 2002–2011. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the Fourth International Conference on 3D vision (3DV), Stanford, CA, USA, 25–28 October 2016. [Google Scholar]
- Ibraheem, A.; Wonka, P. High Quality Monocular Depth Estimation via Transfer Learning. arXiv 2018, arXiv:1812.11941. [Google Scholar]
- Hu, W.; Dong, W.; Liu, N.; Chen, Y. LUMDE: Light-Weight Unsupervised Monocular Depth Estimation via Knowledge Distillation. Appl. Sci. 2022, 12, 12593. [Google Scholar] [CrossRef]
- Bui, M.T.; Doskocil, R.; Krivanek, V.; Ha, T.H.; Bergeon, Y.; Kutilek, P. Indirect method usage of distance and error measurement by single optical cameras. Adv. Mil. Technol. 2018, 13, 209–221. [Google Scholar] [CrossRef]
- Zhang, Z.; Han, Y.; Zhou, Y.; Dai, M. A novel absolute localization estimation of a target with monocular vision. Optik 2013, 124, 1218–1223. [Google Scholar] [CrossRef]
- Divam Gupta. A Beginner’s Guide to Deep Learning based Semantic Segmentation using Keras. 2019. Available online: https://divamgupta.com/image-segmentation/2019/06/06/deep-learning-semantic-segmentation-keras.html (accessed on 8 December 2022).
- Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 22, 1330–1334. [Google Scholar] [CrossRef] [Green Version]
- Heikkila, J.; Silvén, O. A Four-step Camera Calibration Procedure with Implicit Image Correction. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1997, 1106–1112. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).