Semantic Segmentation and 3D Reconstruction of Concrete Cracks

Shokri, Parnia; Shahbazi, Mozhdeh; Nielsen, John

doi:10.3390/rs14225793

Open AccessArticle

Semantic Segmentation and 3D Reconstruction of Concrete Cracks

by

Parnia Shokri

^1,*

,

Mozhdeh Shahbazi

²

and

John Nielsen

¹

Department of Electrical Engineering, University of Calgary, Calgary, AB T2N 1N4, Canada

²

Department of Geomatics Engineering, University of Calgary, Calgary, AB T2N 1N4, Canada

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(22), 5793; https://doi.org/10.3390/rs14225793

Submission received: 11 October 2022 / Revised: 1 November 2022 / Accepted: 10 November 2022 / Published: 16 November 2022

Download

Browse Figures

Versions Notes

Abstract

:

Damage assessment of concrete structures is necessary to prevent disasters and ensure the safety of infrastructure such as buildings, sidewalks, dams, and bridges. Cracks are among the most prominent damage types in such structures. In this paper, a solution is proposed for identifying and modeling cracks in concrete structures using a stereo camera. First, crack pixels are identified using deep learning-based semantic segmentation networks trained on a custom dataset. Various techniques for improving the accuracy of these networks are implemented and evaluated. Second, modifications are applied to the stereo camera’s calibration model to ensure accurate estimation of the systematic errors and the orientations of the cameras. Finally, two 3D reconstruction methods are proposed, one of which is based on detecting the dominant structural plane surrounding the crack, while the second method focuses on stereo inference. The experiments performed on close-range images of complex and challenging scenes show that structural cracks can be identified with a precision of

96 %

and recall of

85 %

. In addition, an accurate 3D replica of cracks can be produced with an accuracy higher than 1 mm, from which the cracks’ size and other geometric features can be deduced.

Keywords:

semantic segmentation; deep learning; generative adversarial networks; stereo vision; 3D reconstruction

Graphical Abstract

1. Introduction

A considerable percentage of urban and rural infrastructure consists of concrete structures, such as dams, buildings, bridges, and sidewalks. Thus, damage assessment of such infrastructure is a critical component of structural health monitoring routines [1]. However, manual damage assessment of concrete structures is error-prone, time-consuming, and in many cases even life-threatening, e.g., when inspecting severely damaged or inaccessible structures. Therefore, in the past decade, automating structural health inspection has been an active area of research [2,3,4,5]. Specifically, vision-based approaches have found an invaluable place among such innovations [6,7,8,9].

Various types of concrete damage can be automatically detected from images, e.g., spalling [10] and discoloration [11]. This paper particularly concentrates on concrete cracks, which are typically challenging to identify due to their small size and visual similarity to other edge-like features [9]. The first step in automating concrete crack inspection is detecting the cracks, i.e., the cracks should be identified in the images. This identification by itself does not provide enough information about the severity of the crack. Therefore, measuring the crack’s attributes is required to decide what type of maintenance the crack requires. In this paper, both aspects are studied, and several approaches are investigated to address each aspect.

Traditional approaches for detecting cracks are based on conventional Computer Vision (CV) techniques such as morphological operations [7,12], adaptive filtering, Markovian modelling [13], and edge detection [14]. These techniques do not generalize well to complex real-world images, and require modifications from one case to another. In addition, these methods are prone to errors due to environmental changes such as lighting conditions, and crack-like features such as concrete joints are usually falsely detected as cracks. In the past few years, machine learning-based methods have become popular for detecting cracks, e.g., random forest [15] and support vector machine [16]. What mostly revolutionized this area, however, is the Convolutional Neural Network (CNN) aproach [6,9,17,18,19]. Techniques for both object detection [19] and semantic segmentation [6] can leverage CNNs to identify cracks. The result of object detection approaches is a bounding box surrounding the crack. For instance, Cao et al. [20] used object detection methods to draw a single bounding box around the crack, and Cha et al. [19] used a sliding window method to scan the image and detect several bounding boxes, each containing a part of the crack. In semantic segmentation methods, however, each pixel is classified as either a crack or another category. The output of semantic segmentation can represent a crack more accurately than object detection methods, as the crack’s full figure can be identified. Figure 1 illustrates this difference. UNet [21] and SegNet [22] are examples of popular CNN architectures used in semantic segmentation.

Lau et al. [23] used a UNet-based network architecture, replacing the encoder with a pre-trained ResNet34 model, in order to segment pavement cracks. Lin et al. [24] designed a crack segmentation solution based on Attention UNet, an architecture based on UNet that uses an attention mechanism [25] as well as skip connections. To detect cracks in 3D pavement images, Hsieh et al. [26] proposed a variation of UNet that uses a channel attention mechanism in the model’s decoder to highlight the salient features coming from the encoder. The authors in [27] used SegNet to segment bridge cracks and showed that it outperforms traditional segmentation techniques based on Canny [28] and Sobel [29] edge detection. In [30], a neural network model was designed based on SegNet to segment cracks in concrete pavements, asphalt pavements, and bridge decks. There is a different category of segmentation networks, called Structured Prediction (SP), that classifies each pixel based on its neighborhood pixels [18]. The authors in [9] used SP for detecting pavement cracks. Their model received image patches produced by a sliding window and classified the middle pixels of each patch as either a crack or background. The authors claimed that their model led to fewer false positives compared to a single-label classifier.

While proven valuable, CNN-based approaches involve several significant challenges. Supervised deep learning approaches require a large amount of annotated data. Thus, the first challenge in semantically segmenting cracks using CNNs is data inadequacy. Usually, finding an adequate number of pictures of cracks is not the main challenge, as there are abundant defects in urban concrete structures. The main challenge involves labeling these pictures by drawing the exact outlines of the cracks [31]. With regard to this problem, there are a few publicly available datasets for crack segmentation [13,15,32]. However, they do not represent the complete range of challenges encountered in real-world scenarios, e.g., ambiguous texture patterns, presence of other objects in the scene, varying illumination conditions, high resolution, and large field of view.

Having a limited amount of data, one can leverage data augmentation approaches to enhance the dataset. Data augmentation refers to techniques that increase the number of training samples by modifying the existing data or synthetically creating new samples. In [33,34], researchers showed the benefits of using data augmentation for neural networks on optical datasets. This approach can reduce overfitting or help extract more information from existing datasets. In addition to traditional data augmentation methods, Generative Adversarial Networks (GAN) for producing synthetic training data are claimed to be useful in certain cases [35,36].

In addition to data augmentation, there are other techniques used in the literature to train a semantic segmentation model with a limited dataset. A two-stage transfer-learning approach was used in [37] for segmenting images of underfloor scenes. Stan et al. [38] proposed an unsupervised domain adaptation algorithm that helps an image segmentation model to generalize in a target domain without requiring access to the source domain data. Similar studies [39,40] have proposed domain adaptation techniques for improving semantic segmentation models.

The other challenge of the segmentation step is class imbalance. In other words, there is a considerably lower number of crack pixels in any image compared to the background pixels. This issue can negatively affect the accuracy of semantic segmentation by causing model biases [41]. In the literature, the Binary Cross Entropy (BCE) loss function is often applied to train semantic segmentation networks. In such functions, a higher weight can be assigned to the less frequent classes. However, in the case of cracks, the imbalance is much more severe than being handled by simple weighting. Another challenge in this field is that an image might contain different sizes of cracks, i.e., small cracks with a lot of details versus larger cracks. The segmentation model should be able to correctly identify the cracks regardless of their size and complexity.

As mentioned previously, the next step in concrete crack inspection is assessing the shape and size of the cracks [7,42]; this step is known as metric quantification. The best way to do this is to create 3D replicas, which allows for further investigation of the cracks’ severity. For example, cracks longer or wider than a certain threshold might be identified for immediate maintenance, or temporal assessments might be needed to identify the growth rate of a critical crack. While different sensors can be exploited to build such a representation, optical and image-based approaches have gained popularity in the field of structural health monitoring [43] due to being contact-less, nondestructive, and inexpensive. The task of 3D reconstruction from images is a well-studied topic. Several photogrammetric approaches, either based on stereo vision or structure from motion (SfM), exist to complete this task. In the case of quantifying cracks, the scale of 3D reconstruction is of paramount importance. Scale definition refers to ensuring that the distances between 3D-reconstructed points equal their true distances. The main reason is that only a few millimeters of error can affect the conclusion as to whether the crack represents considerable damage or not. In the case of SfM, one needs to access external observations (e.g., reference distances between tie points) to define the scale [7,44]. In this regard, a stereo rig of cameras provides the advantage of defining the object scale without the need for external observations, assuming the baseline of the stereo cameras is accurately calibrated. Another solution to determine the scale is to assume that the crack is lying on a planar surface for which the 3D model relative to the camera is known. This assumption is often correct, as concrete structures are piece-wise planar objects [14]. Nonetheless, the key to accurate photogrammetric reconstruction is the geometric quality of the image observations. Cameras inherently cause systematic distortions during observations, mainly due to the imperfect physical symmetry of the lens structure (radial lens distortion), imperfect alignment of lens components (de-centering lens distortion), and non-orthogonality of pixel arrays (sensor distortions). Therefore, camera calibration plays a critical role in acquiring an accurate 3D model of cracks [45]. In addition, a suitable method should be applied for integrating the results of the segmentation neural network and the photogrammetry principles to achieve small errors, high autonomy, and high speed in the case of a method used for real-time applications.

Considering the challenges and gaps in the state-of-the-art, the contributions of this paper are in the followings areas:

Addressing data inadequacy challenges by producing a challenging and complex dataset for crack segmentation, including images with various resolutions and with a variety of crack shapes/sizes, then developing a semi-supervised data annotation approach and investigating the potential of GAN-based data augmentation.
Investigating crack segmentation issues by benchmarking the performance of four state-of-the-art semantic segmentation approaches (SegNet, UNet, Attention Res-UNet, and SP) for identifying cracks; validating the impact of various data augmentation approaches; testing the sensitivity of the models to image scale and resolution; and proposing the use of a loss function based on Intersection of Union (IoU) to reduce the impact of class imbalance on model performance.
Proposing a new calibration model for calibrating a commercial stereo camera (ZED by Stereolabs); and investigating single-image and stereo 3D reconstruction of segmented cracks based on planarity assumptions and stereo inference, respectively.

2. Methodology

In this section, we first describe the process of collecting and preparing a custom dataset for semantic segmentation of concrete cracks. Then, the design process of semantic segmentation networks is discussed. Finally, the methods used to calibrate the stereo camera and model the concrete cracks in 3D are presented.

2.1. Dataset Preparation and Augmentation

Here, we present a new dataset for crack segmentation from concrete sidewalks called the Concrete Crack Semantic Segmentation dataset (CCSS-DATA), which is publicly available at https://www.kaggle.com/datasets/parniashokri/ccssdata (accessed on 10 October 2022). The dataset consists of 670 RGB images and their corresponding semantic segmentation labels. The images were mostly taken on the University of Calgary campus from distances of 1.5–3 m above the sidewalk at different angles. A few samples of the dataset along with the ground-truth labels are shown in Figure 2.

In order to generalize the scene and include challenging cases in the dataset, images were taken during two seasons, spring and winter, under different lighting conditions, and with grass, rain, snow, and various objects visible in the images. To further maximize the generalization of the dataset, four different cameras were used to collect the images. CCSS-DATA consists of 308 images captured with a Canon EOS M5 DSLR camera equipped with a Canon EF 24 mm

f / 2.8

IS USM Lens, 124 images acquired with the camera on a Xiaomi Mi A1 smartphone, 155 images captured by a ZED stereo camera, and 83 images taken with the camera on a Samsung Galaxy S7 smartphone. This variety in the captured images can help reduce the chance of overfitting in the segmentation network during training.

Considering that the resolution of most of these devices is high (+

4 K

), the images were resized to

2 K

to facilitate annotation. While resizing, interpolation was performed using pixel area relation, which is a preferred method for image decimation, as it does not cause interference patterns in the results. The sensors’ original aspect ratios were kept, such that the width of the images is always 2048 pixels. The height of the images is 1365 pixels for the images captured by the Canon DSLR camera, 1152 pixels for the ZED camera, and 1536 pixels for the Samsung and Xiaomi cameras.

To prepare the ground truth for our proposed dataset, we developed a computer vision-based algorithm that makes the annotation task semi-automatic; the user only has to clean up the false positive crack pixels detected by the algorithms, instead of manually labeling the crack figures. We developed the annotation approach following the work of [7] based on morphological opening and closing operations. The original approach processes the image in different orientations, and needs the approximate size of the crack to be manually specified by the user in order to set the dimensions of the morphological operation kernels. In our approach, however, we avoid the need for user-specified parameters by iteratively overlaying the results of image processing with various kernel sizes. In each iteration, pixels in the current and last processed image are compared, and the pixels with the highest intensity values are stored. The detected crack pixels are then dilated to ensure that the crack pixels are covered thoroughly. This operation can be formulated as

\begin{matrix} w_{n} = m a x ((I \circ K_{i}) • K_{i}, I) - I, \forall i \leq n, \\ L_{n} = \{\begin{matrix} L_{i} = w_{i} & i = 0, \\ L_{i} = m a x (w_{i}, L_{i - 1}) & \forall 0 < i \leq n \end{matrix} \end{matrix}

(1)

where L is the label image, I is the input image, n is the number of iterations, ‘∘’ and ‘•’ are morphological opening and closing operations, respectively, and K is the kernel. The kernel size is set to (

2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 50, 60, 90

) iteratively in order to cover both fine and wide cracks. This method ensures that cracks are well detected independently of their sizes and that there is no need to process the images in different orientations. The downside of this algorithm is that noisy artifacts in the background are detected as cracks. As a result, the user needs to clean the unwanted parts using a digital eraser tool that can take different shapes to facilitate fast data cleaning. This semi-automatic labeling toolkit helped us save a considerable amount of time when annotating the dataset.

This software generated the first version of the training dataset. The first trained network then inferred the next generation of the dataset to minimize manual annotation efforts. The graphical user interface was used again, this time to correct the mistakes of the neural network in order to prepare the second and final version of the training/testing datasets.

The images were resized to 512 by 512 pixels for training/inference. It is hypothesized that by extracting zoomed-in areas in highly detailed images and adding them to the training dataset, the semantic segmentation network has a higher chance of learning cracks’ geometric features in different scales. In order to validate this assumption, up to three patches with a resolution of 512 by 512 pixels were selected manually from the high-resolution

2 K

pixel images to ensure that highly complex and challenging areas of the images were preserved.

For further enhancement, we merged this dataset with a less challenging reference dataset [32]. This publicly available dataset includes 458 close-up images (captured by an iPhone 6s) and their corresponding semantic labels, 186 of which are

2448 \times 3264

pixels and 272 of which are

4032 \times 3024

pixels. A few samples of this dataset are shown in Figure 3. Combining the two datasets, a total of 1128 images were available, of which 721 were used for training, 181 for validation, and 226 for testing.

As discussed before, it is necessary to perform segmentation on large-size images to ensure the detectability of narrow yet long cracks. There exist two common ways to handle these: down-sampling [46,47,48] and cropping into small patches [49,50]. In the case of crack detection, cropping is not a suitable approach, as the contextual and global information cannot be preserved in small patches [49]; the crack loses its identity as an elongated object if cropped into small segments. Down-sampling causes a loss of information as well, because fine cracks may only appear in the original resolution. To minimize the effects of down-sampling, the images were resized to

512 \times 512

pixels resolution when input to the segmentation network. This dimension of training images is considered high-resolution compared to the literature [35,36,51], and it preserves the cracks’ shapes. To reduce the effects of this down-sampling on the performance of the networks and to give the network a higher chance of learning the cracks’ features in different scales, up to three cropped patches with a resolution of

512 \times 512

pixels were selected manually from the full-size images. To enhance the training dataset, the following traditional augmentations were applied: Random rotation (−45 to +45°), scaling (80–120%), shearing (−15 to +15°), and perspective transform (standard deviation of 0.01−0.15 for the normal distributions used to sample the random distances of the sub-image’s corners from the full image’s corners). The augmented samples were then added to the training dataset, doubling the number of images.

Furthermore, we wanted to investigate whether GAN-based augmentation could enhance the performance of crack segmentation. In order to make a GAN generate the images and labels jointly, we considered the label mask as the fourth channel of an image during both training the GAN and data generation. In other words, the input and output layers of the GAN were changed to accept images with four channels (RGB + label) instead of three (RGB) in order to augment the segmentation training dataset. By applying this modification, the network learns to consider both the global context and to have a better understanding of what a crack should look like and where it should be located. We applied this method to two different networks: DCGAN [52] and ProGAN [53]. They were both trained to generate high-resolution images of concrete sidewalks with labeled cracks. Compared to DCGAN, ProGAN has the advantage of progressively scaling the model and images to better learn the distribution of the training set (see Section 3.1 for further discussions). Therefore, the synthetic images generated by ProGAN’s generator were added to the training dataset, doubling the segmentation training data volume.

2.2. Semantic Segmentation Architectures

2.2.1. SegNet

SegNet with ResNet50 as its encoder is one of the networks used in this study. The architecture of ResNet50 is displayed in Table 1. In this table, the residual block includes a shortcut with a convolution operation. If the convolution operation is replaced with an identity function, the block is called an identity block. The numbers provided in Table 1 show the parameters of the layers in the blocks. For example, a kernel size of [1, 3, 1, 1] for a residual block means that there are four convolutional operations in this block and their kernel sizes are

1 \times 1

,

3 \times 3

,

1 \times 1

and

1 \times 1

, respectively.

The encoder’s output is directly fed to the decoder, with no additional shortcuts in between. The decoder architecture of SegNet is presented in Table 2, where convolution and up-sampling operations are used to increase the resolution of the layers and provide a segmentation mask as the final output. The number of trainable parameters for SegNet is 5,361,605.

2.2.2. UNet

UNet with ResNet18 as its decoder is another network studied in this paper. The architecture of the encoder is presented in Table 3. The only difference is that on the left branch of the blocks, only two convolution operations are used.

The final output of the encoder is directly fed to the decoder shown in Table 4. Although the last output of the encoder is the input to the decoder, unlike SegNet, there are skip connections between the encoder and decoder. In total, four skip connections exist between UNet’s encoder and decoder. The start and end of the connections are marked with an asterisk (*) on Table 3 and Table 4. An asterisk after an encoder layer means that the marked layer’s output is the start of the skip connection, while an asterisk before a decoder layer means that the encoder’s output is concatenated to the input and fed to the marked decoder layer (the end of the skip connection). The training hyper-parameters of UNet are the same as those of SegNet summarized in Table 5. These parameters were empirically decided upon by exhaustively testing different combinations of optimizers, learning rates, and batch sizes.

A variation of UNet has been proposed by [54] which introduces an attention mechanism at the skip connection in the decoder path. The implemented attention suppresses activations at irrelevant regions, making the feature representation more accurate. We use a similar architecture to this paper with 10 million trainable parameters, which we refer to as Attention Res-UNet. The five downsampling layers of ResNet50 were selected as the encoder (pre-trained on ImageNet), and five upsampling convolutional layers were used in the decoder.

It is worth noting that although newer semantic segmentation architectures have been introduced in the past several years, UNet remains comparable to the state-of-the-art in terms of performance. For example, in [55] the authors demonstrated that the IoU of UNet is similar to DeepLabV3’s for segmenting lesions and organs from biomedical images.

2.2.3. Structured Prediction

We implemented the original architecture of SP proposed in [9]. This model includes four convolution layers as well as two fully connected layers. To segment an image, SP accepts RGB patches of

27 \times 27

pixels around each pixel of the input image. At the centre of each patch, a smaller patch of

5 \times 5

pixels is considered, in which the crack pixels have the intensity value of 1 in the segmentation mask and the background pixels are considered 0. The goal is that by learning the structure of the

27 \times 27

patch, the network can predict the classes of the

5 \times 5

patch’s pixels. Liskowski et al. [18] mentioned that this multi-label classification approach leads to fewer false positives compared to a single-label classifier. This is due to the fact that single-label classifiers may focus on classifying the anchor (center) pixel of the patch based on the average of the features available from the whole patch, while this multi-label classifier tries to learn the structure of the patch in greater detail. The hyperparameters used for SP training are provided in Table 6.

2.3. Training Details

2.3.1. Transfer Learning

Transfer learning ensures faster convergence and increases the generalization of a neural network [56]. Although it works best in cases where the dataset is similar to what the previous network was trained on, it has been shown that it can be beneficial even if the tasks are different [57]. In this paper, the encoders were pre-trained on the ImageNet [58] dataset, and their weights were not frozen during training.

2.3.2. Early Stopping

Instead of training with a fixed number of epochs, early stopping can be used to prevent overfitting. During the early stopping method, the validation loss is monitored. If the loss does not decrease for 50 consecutive epochs, the training is stopped.

2.3.3. Loss Functions

One solution for addressing the data imbalance problem is to use specific loss functions [59]. Jaccard loss, known as the IoU loss function, is one such options. The formula for this loss function is provided in Equation (2).

\begin{matrix} I o U = \frac{|T ⋂ P|}{|T ⋃ P|} \end{matrix}

(2)

In this equation, T stands for the ground truth labels and P stands for the predictions. A prediction P is a continuous number between 0 and 1 output by the sigmoid activation function in the last layer of the network. In order to solve it, we can approximate the formula depicted in Equation (3):

\begin{matrix} I o U^{^{'}} = \frac{|T * P|}{|T + P - (T * P) |} = \frac{I}{U} \end{matrix}

(3)

As with any other loss function, as we approach the desired result the loss should approach zero. Thus, the loss function can be defined as in Equation (4):

\begin{matrix} L_{I o U} = - log (I o U^{^{'}}) \end{matrix}

(4)

2.4. Statistical Significance

In order to assess the impact of the suggested solutions, e.g., the augmentation approach, loss function, and network architecture, benchmarking against baseline methods is necessary [60]. There are several random aspects in training a model; for instance, the parameters of the segmentation networks contain randomness. Most importantly, although the encoders were pre-trained on ImageNet [46], the decoders were initialized with random weights. Dropout and batch-normalization are other random aspects of the training procedure. As such, one should define whether a real and meaningful improvement takes place when using any specific model. For instance, is a

1 %

improvement in accuracy meaningful, or could it have simply happened randomly? To ensure fair comparisons, a statistical significance test is required. In this paper, the Wilcoxon signed-rank test [61] is used, which is a non-parametric version of the paired t-test. It tests the null hypothesis that two related paired samples arise from the same distribution. To perform this test, the training procedure is repeated multiple times (20 in our experiments), with each resulting in a slightly different trained model. Then, for every trial, each performance measure of any solution is tested against the same performance measure of the baseline method. A one-tailed test is selected to show the direction of the difference as well.

2.5. Calibration

The stereo camera used in this study was a ZED from Stereolabs. A calibration test field including 132 targets was established to calibrate the ZED stereo camera [62]. Figure 4 shows this test field. Ninety-two image pairs were captured at different depths and with various orientations, resulting in a total of 16,760 observations in the images. In order to evaluate the accuracy of the calibration, an independent set of 25 image pairs was captured. Figure 5 and Figure 6 show the setup of the calibration pairs and the evaluation pairs. In these figures, the blue pyramids show the different camera angles, while the red crosses show the targets in the calibration field. Four external corners of the checkerboard were used as control points in the evaluation process, which are represented by cyan crosses in Figure 6. There are no control points involved in the calibration process since a free-network bundle adjustment was used.

Calibration Model

In order to estimate both the intrinsic calibration parameters and the relative orientation parameters, the augmented collinearity equations are solved via the self-calibrating bundle adjustment [63,64]. In bundle adjustment, a large set of observation equations (augmented colinearity equations) are formed, where the observations are the image coordinates of the targets identified on a large set of images captured from the test field in various locations and orientations. This equation system should be solved for the unknown parameters, which include the Exterior Orientation Parameters (EOPs) of all the images, the refined 3D coordinates of the targets, and most importantly, the intrinsic camera calibration parameters.

The manufacturer of the ZED stereo system provides the intrinsic calibration parameters of the two cameras as well as their mounting parameters by OpenCV convention [65]. However, we found out that these parameters need considerable improvement in order to yield millimeter-level accuracy for 3D reconstruction. Thus, a new model was introduced in which the collinearity equations of the perspective projection model were augmented with five radial lens distortion terms as well as affine sensor distortion terms, described in Equations (5)–(8):

\begin{matrix} x_{i j} - c_{x_{m}} + δ_{x_{i j}} + f_{m} \frac{u_{i j}}{w_{i j}} = 0 \\ y_{i j} - c_{y_{m}} + δ_{y_{i j}} + f_{m} \frac{v_{i j}}{w_{i j}} = 0 \end{matrix}

(5)

where

\begin{matrix} {[u_{i j} v_{i j} w_{i j}]}^{T} = [R_{o}^{j} - R_{o}^{j} C_{j}] {\tilde{X}}_{i} \end{matrix}

(6)

Subscript i refers to the tie-point index, j refers to the image index, and o denotes the object coordinate system, while

R

is the rotation matrix from the object to the camera coordinate system, C refers to the position of the camera perspective center in the object coordinate system (together,

R

and C represent the EOPs of the image), and

\tilde{X}

represents the homogeneous coordinates of the object point. In Equation (5), (

c_{x}

,

c_{y}

, f) are the interior orientation parameters of the cameras and (

δ_{x}

,

δ_{y}

) are the distortion corrections, which are modeled by Equations (7) and (8):

\begin{matrix} δ_{x_{i j}} = (x_{i j} - c_{x_{m}}) (k_{1_{m}} r_{i j}^{2} + k_{2_{m}} r_{i j}^{4} + k_{3_{m}} r_{i j}^{6}) \\ + p_{1_{m}} (r_{i j}^{2} + 2 {(x_{i j} - c_{x_{m}})}^{2}) + 2 p_{2_{m}} (x_{i j} - c_{x_{m}}) (y_{i j} - c_{y_{m}}) \\ + s_{1_{m}} (x_{i j} - c_{x_{m}}) + s_{2_{m}} (y_{i j} - c_{y_{m}}) \end{matrix}

(7)

\begin{matrix} δ_{y_{i j}} = (y_{i j} - c_{y_{m}}) (k_{1_{m}} r_{i j}^{2} + k_{2_{m}} r_{i j}^{4} + k_{3_{m}} r_{i j}^{6}) \\ + p_{2_{m}} (r_{i j}^{2} + 2 {(y_{i j} - c_{y_{m}})}^{2}) + 2 p_{1_{m}} (x_{i j} - c_{x_{m}}) (y_{i j} - c_{y_{m}}) \end{matrix}

(8)

where

r_{i j} = \sqrt{{(x_{i j} - c_{x_{m}})}^{2} + {(y_{i j} - c_{y_{m}})}^{2}}

,

k_{1_{m}}

to

k_{3_{m}}

are the radial lens distortion parameters,

p_{1_{m}}, p_{2_{m}}

are the decentering lens distortion parameters, and

s_{1_{m}}, s_{2_{m}}

are the sensor affine distortion parameters.

Despite using this modified calibration model, unmodelled systematic errors in the residuals along the x-direction could be observed. Therefore, inspired by [66], a quartic polynomial as a function of x (Equation (9)) was added to the distortion terms in Equation (7) in the x-direction:

\begin{matrix} {δ_{x_{i j}}}^{p o l y} = s_{3_{m}} {(x_{i j} - c_{x_{m}})}^{2} + s_{4_{m}} {(x_{i j} - c_{x_{m}})}^{3} + s_{5_{m}} {(x_{i j} - c_{x_{m}})}^{4} \end{matrix}

(9)

The above equations were then input to the self-calibrating free-network bundle adjustment algorithm [64] in order to calibrate the cameras. Relative orientation stability constraints were added to the self-calibrating bundle adjustment in order to estimate the relative orientation parameters between the stereo cameras. For an image pair k, the relative rotation matrix

R_{r}^{l}

and the translation vector

r_{r}^{l}

between the left and right cameras can be expressed by Equation (10)

\begin{matrix} R_{r}^{l} = R_{o}^{l_{k}} {(R_{o}^{r_{k}})}^{T} \\ r_{r}^{l} = R_{o}^{l_{k}} (r_{r_{k}}^{o} - r_{l_{k}}^{o}) \end{matrix}

(10)

where l denotes the left camera and r denotes the right camera in the stereo system.

Because the left and right cameras remain stationary relative to each other, Equation (10) must hold for all image pairs. Therefore, for a different image pair, h, the stability of the relative orientation (RO) parameters requires Equation (11). Despite being in the form of translation vectors, these equations involve rotation matrices as well, i.e., they constrain both the lever arm offsets and the bore-sight angles between the two cameras.

\begin{matrix} R_{o}^{r_{h}} (r_{l_{h}}^{o} - r_{r_{h}}^{o}) = R_{o}^{r_{k}} (r_{l_{k}}^{o} - r_{r_{k}}^{o}) \\ R_{o}^{l_{h}} (r_{r_{h}}^{o} - r_{l_{h}}^{o}) = R_{o}^{l_{k}} (r_{r_{k}}^{o} - r_{l_{k}}^{o}) \end{matrix}

(11)

2.6. 3D Reconstruction

In this section, two photogrammetric approaches for reconstructing concrete cracks are presented. Both of these approaches rely on the semantic segmentation output and the parameters estimated during the calibration stage.

2.6.1. Planar Approach

In many cases, such as concrete sidewalks, cracks are located on a planar surface. Thus, the first 3D reconstruction approach was designed based on this assumption. To implement this approach, salient features from the stereo images were first detected using the Harris corner detector [67]. Then, their descriptors were extracted using the SURF approach [68] and robustly matched between the two images using the knowledge of the RO parameters obtained through calibration. Then, the calibration parameters of the stereo cameras were used in a simple spatial intersection process to determine the 3D coordinates of these tie points. Next, the dominant plane which best fits the tie points immediately surrounding the detected crack was determined using Random Sample Consensus (RANSAC). Finally, the collinearity equations for crack pixels observed in the left image along with additional constraints to enforce the crack points to lie on the dominant plane were solved (with zero degrees of freedom) to determine the 3D coordinates of each point. The main advantage of this approach is that the crack pixels only need to be identified in the images of the left camera.

2.6.2. Stereo Inference Approach

In structures such as bridges, concrete structures might not be planar. As a result, in this approach we do not assume that cracks lie on a planar surface. Instead, the detected crack pixels from the left and right images are matched to each other. This method is more time-consuming, as it is necessary to perform semantic segmentation on both images as well as to complete the matching process for all the crack pixels. To facilitate matching, images were first undistorted and then stereo-rectified using the method from [69]. In the rectified stereo pairs, the corresponding epipolar lines become parallel, and the search space for matching becomes very small; a crack pixel from the left image may only be matched against the crack pixels that lie on the corresponding epipolar line in the right image.

3. Experiments and Results

The purpose of this section is to analyze the experiments carried out to potentially improve the performance of semantic segmentation networks for identifying concrete cracks as well as to improve the precision of the 3D reconstruction using stereo cameras.

3.1. Synthetic Image/Label Generation by GANs

To augment the training dataset, we first explored DCGAN [52] to learn the joint representation of the images and labels. As shown in Figure 7, the results were noisy and not realistic. Therefore, ProGAN [53] was explored instead. In ProGAN, the network first learns the overall content of the image by down-sampling the training images, then gradually learns the details by upsampling the images to their original size. This cascade procedure helps to improve the quality of the generated images compared to DCGAN. ProGAN can learn both the structure of the cracks and the background quite well. It even reproduces concrete panel joints and successfully distinguishes them from cracks.

An experiment was performed to understand the effect of resolution in the images generated by ProGAN by training it at three different resolutions:

1024 \times 1024

,

512 \times 512

, and

256 \times 256

pixels. We noticed that at the higher the training resolution, more unrealistic RGB images were generated, though the corresponding crack labels were not affected. As shown in Figure 8, unrealistic color artifacts appear on

1024 \times 1024

RGB images generated by ProGAN; in image (c) in Figure 8, a green shade can be seen along the crack, and pink-colored regions are visible on the concrete surface. Images generated at

512 \times 512

and

256 \times 256

pixels are almost free of such artifacts. It can be speculated that when the resolution increases, the generator has a harder time recreating the elaborate details of the scene in a realistic manner, and the discriminator has a harder time distinguishing these small defects. Thus, in all of our experiments, the results with

512 \times 512

resolution are used.

3.2. Semantic Segmentation

In this section, the results of semantic segmentation using SP, UNet, and SegNet are discussed both quantitatively and qualitatively. In order to assess the performance of the semantic segmentation networks, the precision, recall, F1, and intersection over union are presented for the following four scenarios: (1) baseline, which is the original dataset without any data augmentation; (2) traditional augmentation, which involves random changes to images such as rotation and perspective transformation; (3) GAN-based augmentation; and (4) both augmentation techniques together.

A one-tailed Wilcoxon signed-ranked test was used to validate the significance of the achieved results in each scenario as compared to the baseline. Each scenario was run 20 times. The p-value was set to 5% for each test. The following tables report the average performance measures of 20 runs, with the units expressed as percentages. The green color in the tables shows that the specific measurement is statistically better than the corresponding measurement of the baseline. On the other hand, the red color indicates that the measurement is significantly worse than the baseline.

We begin the performance analysis with the structured prediction network. As mentioned earlier, the structured prediction network only examines at images locally. As a result, the overall structure of the image is not considered when producing the output segmentation mask. The baseline method which does not include any sort of data augmentation behaves poorly with this approach. The F1 score is around

56.5 %

, and the IoU is only

22 %

. Moreover, none of the data augmentation techniques boosts the performance of the baseline scenario. This is due to the fact that the structured prediction network looks at a zoomed-in neighborhood to decide whether a pixel is classified as a crack or background. Although data augmentation techniques are useful for adding variation to the overall feature space of an image, they do not add valuable information at a local level. Because of the poor performance of the structured prediction network, it was ruled out using the Wilcoxon test and not analyzed further.

Moving on to Table 7, SegNet was trained on

512 \times 512

pixel images in the CCSS dataset. Inference was performed on a test dataset with the same resolution. The loss function used in this table is the BCE loss. Compared at the baseline, the network has decent performance, with an F1 score of

83 %

and an IoU of around

51 %

. Adding the traditional augmentation to the training dataset results in a

3 %

improvement in both the F1 score and the IoU. The recall is improved by

4 %

. Though not as much as traditional augmentation, GAN augmentation boosts the recall and IoU by

2 %

. In the fourth scenario, where the two augmentations are added to the training dataset together, their effects are additive; with GAN plus traditional augmentation, the IoU is around

5 %

higher than the baseline,

3 %

from traditional augmentation and

2 %

from GAN augmentation. There is a trade-off, however; the precision of the fourth scenario is worse than the baseline, and thus is shown in red in the table.

Table 8 shows the inference results of UNet trained using the BCE loss, then trained and tested on resized images. Comparing the baseline results to Table 7, we can observe that UNet is more successful than SegNet in this segmentation task. There is a

10 %

improvement in the IoU when changing the network from SegNet to UNet. The fact that UNet is a larger network with more trainable parameters, as well as the architectural differences between UNet and SegNet, might be reasons for this difference. According to Table 8, traditional augmentation helps improve the recall, F1, and IoU by

1 %

. Unlike with SegNet, GAN augmentation does not improve the performance metrics. Although we cannot state with certainty, this could happen because (1) GAN-generated images are not different enough from the segmentation training images, (2) GAN-generated images are not realistic enough, or (3) the UNet model implicitly learns the differences between generated and real images.

Table 9 summarizes the performance of SegNet in eight experiments. The experiments were designed to investigate the effects of the following factors on model performance: different data augmentation techniques to overcome data inadequacy issues, different loss functions to overcome class imbalance issues, and the use of zoomed-in image patches in training to overcome scale variance. Due to lack of space, in each row we only include the augmentation scenario that achieved the best F1 score in each experiment. The baseline for the experiments is shown in the first row of the table. The first two rows show the best inference result for SegNet when trained and tested on resized images (non-patch). In the second row, it can be seen that using the Jaccard loss improves the IoU by about

1.4 %

while maintaining the F1 score. Traditional augmentation contributes to the maximum improvements, in this case significantly increasing both F1 score and IoU. Thus far, the inference dataset has had the same spatial resolution as the training images. An experiment was designed to investigate the scale-invariance of the networks. Thus, the third and fourth rows (trained with BCE loss and Jaccard loss, respectively) show the inference results of SegNet trained on resized images and then tested on high-resolution image patches. These two rows have a much lower F1 score and IoU compared to the last two rows. The reason for this is that the cracks in this new inference dataset have a different scale than the training images, and thus the network is not robustly scale-invariant. For the last four experiments (rows 5 to 8), the network was trained on both resized images and patches. Incorporating the patches in the training procedure allows the network to detect features at multiple scales. In rows 5 and 6, the network was tested on patches only. The networks trained on BCE and Jaccard losses both show substantial improvements in all performance metrics compared to rows 3 and 4. Traditional augmentation leads to the best results in both experiments. Finally, in order ensure that the network trained with both patches and resized images performs well on resized inference images, experiments on rows 7 and 8 were performed. Comparing these two experiments with the first ones, in which the network was trained on the resized images only, no significant difference in performance can be observed. This means that the ability of the network to detect the cracks in the resized images is kept intact when adding the patches to the training dataset,. Traditional augmentation leads to the best performance improvement compared to the baseline regardless of the loss function. In most of the above experiments, Jaccard loss function and traditional augmentation outperform their peers. Therefore, the best SegNet models are the ones trained for experiments 6 and 8.

Moving on to Table 10, the same eight experiments were performed for UNet; each row shows the best augmentation result. Again, the best UNet models are the ones trained in experiments 6 and 8, with Jaccard loss and traditional augmentation and with patches used for training.

Comparing Table 9 and Table 10, UNet outperforms SegNet in nearly all the experiments. This performance difference can be attributed to both the size of the networks and to the differences in their architectures. As mentioned previously, the UNet architecture used in this paper has around 14 million trainable parameters, while for SegNet the number is around 5 million. The average inference time for SegNet on an Nvidia GeForce GTX 1080 Ti is 27 ms per a 512 by 512 pixel image. The average inference time for UNet on the same GPU is 19 ms per image.

In Figure 9 and Figure 10, the training and validation loss/accuracy are displayed for the baseline SegNet and UNet models, respectively. The loss is the mean value of the loss function in each mini-batch per epoch. The accuracy is the amount of correct guesses divided by total amount of guesses in each mini-batch per epoch. The first two rows show the convergence of the networks when trained with the BCE loss with or without patches. The third and fourth rows display the results of training the same networks with the Jaccard loss function. In these plots, early stopping was performed based on the validation loss. After stopping, the iteration with the best validation loss was selected to ensure the best performance. For example, as shown in the first row of Figure 9, the network was stopped at epoch 80, then the best epoch based on the validation loss was selected and saved as the final checkpoint. A few spikes can be seen in Figure 9 and Figure 10. The Adam optimizer, which is a stochastic gradient descent method, was used to train the networks in both of these plots. Thus, the spikes are a consequence of mini-batch gradient descent with a batch size of 4. The reason for this might be that certain mini-batches contain unlucky data, i.e., there are outliers or difficult cases that are not similar to other samples. Such unlucky mini-batches might lead to the spikes in the loss/accuracy plots.

A similar setup to the one in Experiment 8 of Table 10 was applied to investigate the performance of Attention Res-UNet. The reason this specific experiment was selected is that Attention Res-UNet has a similar architecture to UNet, and UNet achieved the best performance in experiment 8. It can be seen that the precision, recall, F1 score, and IoU on the inference dataset are

94.2 %

,

82.7 %

,

85.4 %

, and

61.7 %

, respectively. Because these results are close to those of UNet on Experiment 8 in Table 10), this experiment shows that although UNet is not a new deep learning architecture, its performance remains comparable to the state-of-the-art.

Qualitative Assessment

In order to display a few qualitative samples of the output masks predicted by UNet and SegNet and compare the results, the best and worst networks in terms of the IoU were selected. Both networks were trained on the resized images as well as a mixture of down-sized images and high-resolution patches. The best network was UNet trained with the Jaccard loss, patches, and traditional augmentation enhancement (

I o U = 63.32 %

). The worst network was SegNet trained with BCE loss without any augmentation or patches (

I o U = 50.71 %

). This qualitative comparison is displayed in Figure 11 for simple and complex cases. Considering the simple cases in the figure, both networks have good performance, with a high number of true positives. However, SegNet is more prone to false negatives in certain cases, as shown in the right-most simple sample. The number of true positives, false positives, false negatives for each sample are shown above each sample. Observing the complex samples, the superiority of the selected UNet network is more obvious, e.g., it has a higher number of true positives and a lower number of both false negatives and false positives.

3.3. Calibration

When Equation (7) is used for calibration, unmodelled error patterns are observed in the residuals. Residuals refer to the differences between the back-projected positions of the estimated tie points after calibration and their true positions in the images. This pattern is specifically more visible in the x-direction, where a high correlation between the error and the x-coordinate is observed (Figure 12). When the calibration model and parameters suggested by the manufacturer (OpenCV model) are used instead, a similar pattern with larger magnitude of the residuals is observed (Figure 13). Using the same model and re-calibrating the parameters, the pattern is greatly reduced, though visible, as shown in Figure 14.

A quartic polynomial term was added to the model of systematic distortions in the x-direction. This modification is able to almost completely remove the unmodelled systematic errors in the residuals (Figure 15). With this model modification, the root-mean-square (RMS) and standard deviation (StD) of the magnitude of the residuals are

0.05

,

0.04

pixels.

As a result of this calibration process, we were able to estimate the relative orientation parameters between the two cameras. The precision of the estimated baseline vector and the relative rotation are 0.67 mm and 0.017 degrees, respectively. Using our calibration approach on the evaluation images, the RMS of the residuals is 0.12 pixels. These residuals are shown in Figure 16. Relying on the manufacturer’s parameters results in an RMS error of 1.18 pixels. The RMS of the 3D stereo reconstruction errors with our calibration parameters is only 0.9 mm, while it is 25.2 mm when using the manufacturer’s parameters.

3.4. 3D Reconstruction

3.4.1. Qualitative Evaluation

For the purpose of evaluating the 3D reconstruction, ten stereo pairs were examined with the goal of creating a 3D model of the cracks relative to the stereo system. The images were captured from 1 to 2 m away from sidewalks, with an average spatial resolution of 0.914 mm. These images were fed to our best UNet model to detect the cracks.

Ground-truth data is needed to assess the reconstruction of the cracks. Thus, 88 checkmarks were painted on the ground along the cracks’ length and width. We then measured the distances between these marks with a caliper. A total of 32 ground-truth distances (see Appendix A.2) are collected in the ten image pairs. The checkmarks are distributed evenly along the cracks; a few distances are measured across the cracks’ width, while others are measured along the length of the cracks. A sample image showing the checkmarks is shown in Figure 17.

Figure 18 displays the overview of the 3D reconstruction process using the planar approach. To the left of the figure, the undistorted RGB image pairs can be seen; the nonlinearities due to distortions are adjusted. The segmentation mask of the left image is undistorted using the IOPs and distortion parameters. Based on these three images as well as the EOPs, the full extent of the crack is then reconstructed on the right side of the figure.

Another reconstruction process is displayed in Figure 19, where the images are stereo rectified; the segmentation masks of both the left and right images are needed for matching the crack features. To the right of the figure, the full extent of the crack in this stereo pair is reconstructed in 3D using the matching approach.

3.4.2. Quantitative Assessment

The first proposed approach, based on the planar assumption, results in an RMS of

1.23

mm and an StD of

0.8755

mm for the reconstruction error. The reconstruction error is defined as the difference between the checkmarks’ ground truth distances and the distances calculated after 3D reconstruction. The main advantage of this approach is its speed, as the crack needs to be detected only in the left image. Our experiments show that, on average, a crack consisting of

50, 000

pixels can be reconstructed in

0.06

s (CPU usage only). The second approach, based on stereo matching and spatial intersection, results in an error with an RMS of

0.86

mm and an StD of

0.3798

mm. Although this approach results in a higher 3D reconstruction accuracy, it is more computationally expensive. In both methods, the accuracy for certain checkmarks is as high as

0.05

mm. The accuracy measurements of both methods are summarized in Table 11. The reconstruction accuracies from both methods are reasonable considering the average spatial resolution of

0.91

mm and the limited precision of manual checkmark detection in the images. In addition, we noticed that the ZED camera always causes a sort of out-of-focus blur on the right edge of all left images and the left edge of all right images. Many of our checkmarks are observed close to image edges. As a result, this blurring issue could have affected check accuracy. Appendix A.2 and Appendix A.3 provide further details on the measurements and reconstruction results.

4. Discussion

This paper studied several deep learning and photogrammetric approaches for visual and metric inspection of concrete cracks. The following research questions were investigated through this study.

How can the problem of data inadequacy in training semantic segmentation networks to identify cracks be overcome?
In addition to collecting a rich dataset containing various types of images, data augmentation techniques were used in this research. During our experiments, it was shown that although GAN augmentation can be helpful in certain cases, the traditional augmentation techniques are more reliable, guaranteeing either boosted performance, or at the very least, not reducing the performance of the semantic segmentation networks. The other approach that helped with data inadequacy was using information from a different segmentation task to pre-train the segmentation network on a large dataset such as ImageNet. In future studies, domain adaptation should be considered.
How can the class imbalance issue for crack segmentation be dealt with?
The CCSS-DATA dataset is naturally imbalanced, as the number of crack pixels in each image is higher than the background pixels, and adding images with no cracks intensifies the overall imbalance. Therefore, choosing the Jaccard loss function instead of the BCE loss ensured higher performance with the same data.
What considerations should be made in order to develop a semantic segmentation approach that can detect cracks at different scales?
In seeking to develop a scale-invariant model, adding smaller zoomed-in image patches to the training dataset before resizing the training samples was helpful. This helped the network detect multiple scales of objects with the same amount of collected data.
Do the semantic segmentation networks work well in outdoor environments with varied lighting conditions, ambiguous texture patterns, and high resolution images covering a large field of view?
The inference images of the collected dataset (CCSS-DATA) include many challenging scenes, such as shadows and background objects, and cover a large field of view. As shown in this study, the UNet and SegNet networks had decent performance on these test scenes.
Are the calibration parameters provided by the stereo camera’s manufacturer sufficient for reconstructing the cracks? If not, what is the best calibration model for the stereo camera?
No, the calibration parameters provided by Stereo Labs were not sufficient to accurately model the cracks. Therefore, modifications were applied to the calibration model to ensure that the distortion parameters and RO parameters of the stereo camera were accurately estimated.
What is the best crack reconstruction approach considering the semantic segmentation results and a well-calibrated stereo camera?
Assuming prior knowledge that the surface of the concrete is planar, the planar 3D modeling approach was the fastest algorithm. Otherwise, the matching approach was more accurate, and does not require the concrete surface surrounding the crack to be planar.

Figure 20 summarizes the improvements applied to the semantic segmentation models to extract information from a limited dataset with class imbalance. In augmentation scenarios, traditional augmentation improved the IoU by

4 %

and F1 by

5 %

. Changing the loss function from BCE to Jaccard loss improved the IoU by

6 %

and F1 by

7 %

. UNet outperformed SegNet by

16 %

on IoU and

17 %

on F1. With zoomed-in patches added to the training dataset, the inferred results on the zoomed-in patches on the inference dataset were improved by

21 %

and

20 %

on IoU and F1 scores, respectively. These statistics represent the maximum improvements observed in our experiments. The overall best semantic segmentation network was UNet trained using the Jaccard loss function, with the help of traditional data augmentation and with zoomed-in patches added to the training dataset. This model achieved a precision above

96 %

, recall above

85 %

, F1 score above

88 %

and IoU above

63 %

.

5. Conclusions

In this paper, a computer vision system for damage inspection of concrete structures was presented. This study, particularly, focused on sideway structures due to their accessibility for data collection and ground-truth measurement. However, the trained models could successfully be applied to other concrete structures, such as buildings and bridges, as shown in Appendix A.1. In order to detect damages, a supervised solution based on semantic segmentation neural networks was presented. The Concrete Cracks Semantic Segmentation dataset, or CCSS-DATA for short, was developed for training the segmentation networks. Moreover, in order to extract as much information as possible from the training dataset, several techniques, including data augmentation (GAN-based vs. traditional augmentation) were evaluated. We investigated whether GAN could be useful as an augmentation solution on a small dataset, as claimed in similar studies in medical imaging [36]. It can be concluded that although GAN augmentation has potential in certain cases, in our usecase traditional augmentation is a safer option that does not negatively affect the performance regardless of the neural network architecture or the training loss function. It is hypothesized that one of the reasons GAN augmentation did not contribute to performance gain was the small size of the GAN training dataset relative to the a high variety of samples. In the context of medical imagery, the limited variety of samples is often the reason for GAN being able to represent data well with a limited number of images; for example, a specific organ appears very similarly in an imaging modality across all patients. However, in our case, concrete cracks and their background environments are too diverse. If a larger dataset were collected, GAN could be trained on more samples, which might help to generate more realistic and diverse images. Moreover, techniques for measuring the quality of the GAN-generated images and choosing the best generated samples [70] can be explored. In future work, we intend to address the number of distinct samples a GAN needs to generate in order to provide additional value to the segmentation training dataset. Because the dataset was imbalanced, the Jaccard loss function outperforms the binary cross-entropy loss function. The best semantic segmentation result is achieved by UNet trained with the Jaccard loss function and using traditional data augmentation (F1 score of

88.45 %

and IoU of

63.32 %

). Additionally, an improved variation of UNet, Attention Res-UNet, was investigated. Because its performance is similar to UNet, and in order to keep the comparison between UNet and SegNet fair, we continued our experiments with the original UNet architecture without the use of the attention mechanism. In the second stage of the proposed system, a crack reconstruction approach using a stereo camera was presented. The 3D reconstruction of such fine objects requires accurate calibration of the stereo camera. Therefore, various calibrations models were explored to ensure accurate estimation of the cameras’ intrinsic and extrinsic parameters. Next, using the semantic segmentation output provided by the first stage and the calibration parameters, two approaches for reconstructing the cracks in three-dimensional space were presented. For the cracks located on a planar surface, a fast and efficient approach based on the planarity assumption was proposed, resulting in an RMS reconstruction error of

1.23

mm. A second approach was proposed based on matching crack pixels across the stereo images. Because this approach requires stereo inference, it is computationally more expensive. However, it removes the need for cracks to lie on planar features, and results in more accurate modeling, with an RMS reconstruction error of

0.86

mm.

Knowing that cracks are elongated objects, in the future, techniques for including prior information about their structure in the loss function of both GANs and semantic segmentation networks could be investigated. In addition, because crack depth can be inferred from stereo images, it could be added as an input to a multi-modal CNN to further increase the accuracy of crack segmentation. Moreover, if cracks are known to be on a planar surface, a planar constraint can be added. In this way, networks can learn from the data while relying partially on prior knowledge about the physical characteristics of cracks.

Author Contributions

Conceptualization, P.S. and M.S.; methodology, P.S.; software, P.S.; validation, P.S., M.S. and J.N.; formal analysis, P.S. and M.S.; investigation, P.S.; resources, P.S. and M.S.; data curation, P.S.; writing—original draft preparation, P.S.; writing—review and editing, M.S. and P.S.; visualization, P.S.; supervision, M.S. and J.N.; project administration, M.S. and J.N.; funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This project was financially supported by the industrial partners, Qii.AI Inc. and CGQ Inc., via Mitacs Accelerate funding, as well as by an NSERC Discovery Grant and the Tri-Council New Frontiers in Research Fund (NFREF-2018-00623).

Data Availability Statement

The introduced dataset for crack segmentation on concrete structures, called the Concrete Crack Semantic Segmentation dataset (CCSS-DATA), is publicly available at https://www.kaggle.com/datasets/parniashokri/ccssdata.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
GAN	Generative Adversarial Network
CCSS-DATA	Concrete Cracks Semantic Segmentation Dataset
DCGAN	Deep Convolutional Generative Adversarial Network
ProGAN	Progressive Growing of Generative Adversarial Network
EOP	Exterior Orientation Parameter
IOP	Interior Orientation Parameter
UNet	A type of semantic segmentation neural network
SegNet	A type of semantic segmentation neural network
ResNet	Residual Network
IoU	Intersection over Union
F1	F1 Score, an evaluation metric
P	Precision
R	Recall
BCE	Binary Cross-Entropy
3D	three-dimensional
StD	Standard Deviation
RMS	Root-Mean-Square
RGB	Red-Green-Blue
DSLR	Digital Single-Lens Reflex
SfM	Structure from Motion

Appendix A

Appendix A.1

The semantic segmentation models in this study were mainly trained on sidewalk images. In order to show the generalization capabilities of our models, Table A1 shows a few samples taken from different concrete structures. A building foundation, concrete wall, and concrete bridges are demonstrated in this figure. Our best UNet model was used to segment the cracks.

Figure A1. Inference on concrete structures other than sidewalks. From left to right, the first three images are taken from bridges, the fourth image is of a building foundation, and the last image is of a concrete wall. First Row: predicted segmentation mask by UNet model. Second Row: corresponding RGB images.

Appendix A.2

Table A1 shows the ground truth distances of checkmark pairs in the ten image pairs acquired for evaluating the performance of the 3D reconstruction of cracks.

Table A1. Ground truth distances of checkmark pairs in ten stereo image pairs used for the evaluation of 3D reconstruction.

Marks	Ground Truth Distance (mm)	Matching Error (mm)	Planar Error (mm)
5–6 7–8	8.25 9.45	−1.26 −0.68	−4.79 −2.31
1–2 3–5 6–9	2.8 92.6 99.6	0.64 −0.9 0.61	0.56 −0.24 0.5
2–3 4–5 6–8 9–10 11–12	20.2 80.5 94.3 24.1 94	0.51 1.2 1.32 1.18 −0.53	0.53 1.4 1.31 0.55 −0.77
1–2 5–6 7–8	12.4 117 17.7	1.24 0.05 0.88	1.15 0.33 0.82
1–2 3–4 5–6 7–8 9–10	5 7.3 87.2 17.2 92	0.72 1.33 0.72 −0.57 1.11	1.1 0.56 0.31 −0.35 0.04
1–2 3–4 5–6	137.3 9.1 149.4	−0.53 0.92 −0.08	−0.6 0.88 −0.71
1–2 5–7	114.9 123.9	0.35 0.53	0.32 0.59
1–2 4–5 8–10 9–10	10.8 143.3 137.8 12.2	0.77 0.73 0.63 0.29	0.8 0.86 0.9 0.2
3–4 5–6	109.45 110.6	1.17 0.85	−0.38 1.29
1–2 5–6 3–4	141.1 95.5 143	0.07 1.07 1.39	0.37 0.54 2.1
	RMS	0.86	1.23
	StD(abs)	0.3798	0.8755

Appendix A.3

Table A2 shows the left image in a stereo pair, the segmented crack using our best UNet model, and the 3D reconstructed cracks using the planar approach. Due to visual similarities between the planar and matching approach, the 3D reconstructed results of the latter are not displayed in the table to avoid redundancy. For the same reason, the right image in a stereo pair is not displayed.

Table A2. Test images, segmented cracks using UNet, and 3D reconstructed cracks using the planar approach. The red and blue pyramids show the left and right camera angles.

Left Image in Stereo Pair	Left Segmented Crack	3D Reconstructed Crack

References

Saatcioglu, M.; Ghobarah, A.; Nistor, I. Effects of the December 26, 2004 Sumatra earthquake and tsunami on physical infrastructure. ISET J. Earthq. Technol. 2005, 42, 79–94. [Google Scholar]
Hassanain, M.A.; Loov, R.E. Cost optimization of concrete bridge infrastructure. Can. J. Civ. Eng. 2003, 30, 841–849. [Google Scholar] [CrossRef]
Yu, S.N.; Jang, J.H.; Han, C.S. Auto inspection system using a mobile robot for detecting concrete cracks in a tunnel. Autom. Constr. 2007, 16, 255–261. [Google Scholar] [CrossRef]
Oh, J.K.; Jang, G.; Oh, S.; Lee, J.H.; Yi, B.J.; Moon, Y.S.; Lee, J.S.; Choi, Y. Bridge inspection robot system with machine vision. Autom. Constr. 2009, 18, 929–941. [Google Scholar] [CrossRef]
Montero, R.; Victores, J.; Martinez, S.; Jardón, A.; Balaguer, C. Past, present and future of robotic tunnel inspection. Autom. Constr. 2015, 59, 99–112. [Google Scholar] [CrossRef]
Lee, D.; Kim, J.; Lee, D. Robust Concrete Crack Detection Using Deep Learning-Based Semantic Segmentation. Int. J. Aeronaut. Space Sci. 2019, 20, 287–299. [Google Scholar] [CrossRef]
Jahanshahi, M.R.; Masri, S.F.; Padgett, C.W.; Sukhatme, G.S. An innovative methodology for detection and quantification of cracks through incorporation of depth perception. Mach. Vis. Appl. 2013, 24, 227–241. [Google Scholar] [CrossRef]
Kim, B.; Cho, S. Automated vision-based detection of cracks on concrete surfaces using a deep learning technique. Sensors 2018, 18, 3452. [Google Scholar] [CrossRef] [Green Version]
Fan, Z.; Wu, Y.; Lu, J.; Li, W. Automatic pavement crack detection based on structured prediction with the convolutional neural network. arXiv 2018, arXiv:1802.02208. [Google Scholar]
Hedayati, M.; Sofi, M.; Mendis, P.; Ngo, T. A Comprehensive Review of Spalling and Fire Performance of Concrete Members. Electron. J. Struct. Eng. 2015, 15, 8–34. [Google Scholar] [CrossRef]
Greening, N.; Landgren, R. Surface Discoloration of Concrete Flatwork; Number 203; Portland Cement Association, Research and Development Laboratories: Skokie, IL, USA, 1966. [Google Scholar]
Jahanshahi, M.R.; Masri, S.F. A new methodology for non-contact accurate crack width measurement through photogrammetry for automated structural safety evaluation. Smart Mater. Struct. 2013, 22, 035019. [Google Scholar] [CrossRef]
Chambon, S.; Moliard, J.M. Automatic road pavement assessment with image processing: Review and comparison. Int. J. Geophys. 2011, 2011, 989354. [Google Scholar] [CrossRef]
Shan, B.; Zheng, S.; Ou, J. A stereovision-based crack width detection approach for concrete surface assessment. KSCE J. Civ. Eng. 2016, 20, 803–812. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Fan, Y.; Zhao, Q.; Ni, S.; Rui, T.; Ma, S.; Pang, N. Crack detection based on the mesoscale geometric features for visual concrete bridge inspection. J. Electron. Imaging 2018, 27, 053011. [Google Scholar] [CrossRef]
Hoskere, V.; Narazaki, Y.; Hoang, T.; Spencer, B., Jr. Vision-based structural inspection using multiscale deep convolutional neural networks. arXiv 2018, arXiv:1805.01055. [Google Scholar]
Liskowski, P.; Krawiec, K. Segmenting retinal blood vessels with deep neural networks. IEEE Trans. Med. Imaging 2016, 35, 2369–2380. [Google Scholar] [CrossRef]
Cha, Y.J.; Choi, W.; Büyüköztürk, O. Deep learning-based crack damage detection using convolutional neural networks. Comput.-Aided Civ. Infrastruct. Eng. 2017, 32, 361–378. [Google Scholar] [CrossRef]
Cao, M.T.; Tran, Q.V.; Nguyen, N.M.; Chang, K.T. Survey on performance of deep learning models for detecting road damages using multiple dashcam image resources. Adv. Eng. Inform. 2020, 46, 101182. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Lau, S.L.; Chong, E.K.; Yang, X.; Wang, X. Automated pavement crack segmentation using u-net-based convolutional neural network. IEEE Access 2020, 8, 114892–114899. [Google Scholar] [CrossRef]
Lin, F.; Yang, J.; Shu, J.; Scherer, R.J. Crack Semantic Segmentation using the U-Net with Full Attention Strategy. arXiv 2021, arXiv:2104.14586. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Hsiel, Y.A.; Tsai, Y.C.J. Dau-net: Dense attention u-net for pavement crack segmentation. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2251–2256. [Google Scholar]
Song, C.; Wu, L.; Chen, Z.; Zhou, H.; Lin, P.; Cheng, S.; Wu, Z. Pixel-level crack detection in images using SegNet. In Multi-Disciplinary Trends in Artificial Intelligence. MIWAI 2019; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; pp. 247–254. [Google Scholar]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Kanopoulos, N.; Vasanthavada, N.; Baker, R. Design of an image edge detection filter using the Sobel operator. IEEE J. Solid-State Circuits 1988, 23, 358–367. [Google Scholar] [CrossRef]
Chen, T.; Cai, Z.; Zhao, X.; Chen, C.; Liang, X.; Zou, T.; Wang, P. Pavement crack detection and recognition using the architecture of segNet. J. Ind. Inf. Integr. 2020, 18, 100144. [Google Scholar] [CrossRef]
Choi, W.; Cha, Y.J. SDDNet: Real-time crack segmentation. IEEE Trans. Ind. Electron. 2019, 67, 8016–8025. [Google Scholar] [CrossRef]
Ozgenel, C.F. Concrete Crack Segmentation Dataset. Mendeley Data 2019. [Google Scholar] [CrossRef]
Taylor, L.; Nitschke, G. Improving deep learning with generic data augmentation. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 1542–1547. [Google Scholar]
Simard, P.Y.; Steinkraus, D.; Platt, J.C. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, UK, 6 August 2003. [Google Scholar]
Bowles, C.; Chen, L.; Guerrero, R.; Bentley, P.; Gunn, R.; Hammers, A.; Dickie, D.A.; Hernández, M.V.; Wardlaw, J.; Rueckert, D. GAN augmentation: Augmenting training data using generative adversarial networks. arXiv 2018, arXiv:1810.10863. [Google Scholar]
Neff, T.; Payer, C.; Stern, D.; Urschler, M. Generative adversarial network based synthesis for supervised medical image segmentation. In Proceedings of the OAGM&ARW Joint Workshop 2017, Vienna, Austria, 10–12 May 2017. [Google Scholar] [CrossRef]
Atkinson, G.A.; Zhang, W.; Hansen, M.F.; Holloway, M.L.; Napier, A.A. Image segmentation of underfloor scenes using a mask regions convolutional neural network with two-stage transfer learning. Autom. Constr. 2020, 113, 103118. [Google Scholar] [CrossRef]
Stan, S.; Rostami, M. Unsupervised model adaptation for continual semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 2593–2601. [Google Scholar]
Huang, J.; Lu, S.; Guan, D.; Zhang, X. Contextual-relation consistent domain adaptation for semantic segmentation. In Computer Vision—ECCV 2020. ECCV 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; pp. 705–722. [Google Scholar]
Liu, Y.; Zhang, W.; Wang, J. Source-free domain adaptation for semantic segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June 2021; pp. 1215–1224. [Google Scholar]
Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Koch, C.; Paal, S.G.; Rashidi, A.; Zhu, Z.; König, M.; Brilakis, I. Achievements and challenges in machine vision-based inspection of large concrete structures. Adv. Struct. Eng. 2014, 17, 303–318. [Google Scholar] [CrossRef]
Kerle, N.; Nex, F.; Gerke, M.; Duarte, D.; Vetrivel, A. UAV-Based Structural Damage Mapping: A Review. ISPRS Int. J. Geo-Inf. 2020, 9, 14. [Google Scholar] [CrossRef] [Green Version]
Kim, H.; Lee, J.; Ahn, E.; Cho, S.; Shin, M.; Sim, S.H. Concrete crack identification using a UAV incorporating hybrid image processing. Sensors 2017, 17, 2052. [Google Scholar] [CrossRef] [Green Version]
Fathi, H.; Brilakis, I. Multistep explicit stereo camera calibration approach to improve euclidean accuracy of large-scale 3D reconstruction. J. Comput. Civ. Eng. 2016, 30, 04014120. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Communications of the ACM; Association for Computing Machinery: New York, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision—ECCV 2014; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; pp. 818–833. [Google Scholar]
Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
Wu, R.; Yan, S.; Shan, Y.; Dang, Q.; Sun, G. Deep image: Scaling up image recognition. arXiv 2015, arXiv:1501.02876. [Google Scholar]
Liu, Y.; Ren, Q.; Geng, J.; Ding, M.; Li, J. Efficient patch-wise semantic segmentation for large-scale remote sensing images. Sensors 2018, 18, 3232. [Google Scholar] [CrossRef] [Green Version]
Dung, C.V.; Anh, L.D. Autonomous concrete crack detection using deep fully convolutional neural network. Autom. Constr. 2019, 99, 52–58. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Su, R.; Zhang, D.; Liu, J.; Cheng, C. MSU-net: Multi-scale U-net for 2D medical image segmentation. Front. Genet. 2021, 12, 639930. [Google Scholar] [CrossRef] [PubMed]
Caruana, R. Learning Many Related Tasks at the Same Time with Backpropagation. In NIPS’94: Proceedings of the 7th International Conference on Neural Information Processing Systems, Denver, CO, USA, 1 January 1994; MIT Press: Cambridge, MA, USA, 1995; pp. 657–664. [Google Scholar]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? arXiv 2014, arXiv:1411.1792. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef] [Green Version]
Rahman, M.A.; Wang, Y. Optimizing intersection-over-union in deep neural networks for image segmentation. In ISVC 2016: Advances in Visual Computing; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; pp. 234–244. [Google Scholar]
Van Beers, F. Using Intersection over Union Loss to Improve Binary Image Segmentation. Bachelor’s Thesis, University of Groningen, Groningen, The Netherlands, 2018. [Google Scholar]
Wilcoxon, F. Individual comparisons by ranking methods. In Breakthroughs in Statistics; Springer: New York, NY, USA, 1992; pp. 196–202. [Google Scholar]
Shokri, P.; Shahbazi, M.; Lichti, D.; Nielsen, J. Vision-Based Approaches for Quantifying Cracks in Concrete Structures. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 43, 1167–1174. [Google Scholar] [CrossRef]
Luhmann, T.; Robson, S.; Kyle, S.; Boehm, J. Close-Range Photogrammetry and 3D Imaging; De Gruyter: Berlin, Germany, 2019. [Google Scholar]
Shahbazi, M.; Sohn, G.; Théau, J.; Ménard, P. Robust structure-from-motion computation: Application to open-pit mine surveying from unmanned aerial images. J. Unmanned Veh. Syst. 2017, 5, 126–145. [Google Scholar] [CrossRef]
OpenCV Camera Calibration. Available online: https://docs.opencv.org/3.4/d4/d94/tutorial_camera_calibration.html (accessed on 19 February 2022).
Lichti, D.D.; Sharma, G.B.; Kuntze, G.; Mund, B.; Beveridge, J.E.; Ronsky, J.L. Rigorous geometric self-calibrating bundle adjustment for a dual fluoroscopic imaging system. IEEE Trans. Med. Imaging 2014, 34, 589–598. [Google Scholar] [CrossRef]
Harris, C.; Stephens, M. A Combined Corner and Edge Detector. In Proceedings of the Alvey Vision Conference, Manchester, UK, 31 August–2 September 1988; Alvey Vision Club: Manchester, UK, 1988; pp. 23.1–23.6. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Fusiello, A.; Trucco, E.; Verri, A. A compact algorithm for rectification of stereo pairs. Mach. Vis. Appl. 2000, 12, 16–22. [Google Scholar] [CrossRef]
Nielsen, C.; Okoniewski, M. GAN Data Augmentation Through Active Learning Inspired Sample Acquisition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]

Figure 1. Illustration of semantic segmentation output mask (a) vs. object detection bounding boxes where a sliding window approach is used to scan the image for cracks (b).

Figure 2. Samples of the collected dataset (CCSS-DATA)—First Row: Ground truth of cracks; Second Row: Corresponding RGB images.

Figure 3. Samples of the dataset collected by [32]—First Tow: Ground truth of cracks; Second Row: Corresponding RGB images.

Figure 4. Calibration test field.

Figure 5. Network of calibration image pairs.

Figure 6. Network of evaluation image pairs.

Figure 7. Sample images generated by DCGAN.

Figure 8. Sample images generated by ProGAN at (a)

256 \times 256

, (b)

512 \times 512

, and (c)

1024 \times 1024

resolutions. Note: The images in this figure are down-sized to the same dimensions for better visualization.

Figure 8. Sample images generated by ProGAN at (a)

256 \times 256

, (b)

512 \times 512

, and (c)

1024 \times 1024

resolutions. Note: The images in this figure are down-sized to the same dimensions for better visualization.

Figure 9. Accuracy and loss of the training and validation datasets per training epoch for SegNet. First Row: baseline, BCE loss, trained on no patch. Second Row: baseline, BCE loss, trained with patches. Third Row: baseline, Jaccard loss, trained on no patch. Fourth Row: baseline, Jaccard loss, trained with patches.

Figure 10. Accuracy and loss of the training and validation datasets per training epoch for UNet. First Row: baseline, BCE loss, trained on no patch. Second Row: baseline, BCE loss, trained with patches. Third Row: baseline, Jaccard loss, trained on no patch. Fourth Row: baseline, Jaccard loss, trained with patches.

Figure 11. Best vs worst qualitative predictions—Best: UNet + Jaccard loss + traditional augmentation + trained on patches and resized images; Worst: SegNet + BCE loss + no augmentation + trained with resized images and no patches. S: Simple, C: Complex, True Positives (TP): Green, False Positives (FP): Red, False Negatives (FN): Blue. The TP, FP, and FN number are displayed above each image.

Figure 12. Unmodelled systematic errors in the residuals using the photogrammetric model, including five radial lens distortion coefficients described in Equation (7). (Left) The magnitude of the residuals versus the radial distance of the observations from the image center. (Right) The x-component of the residuals versus the x coordinate of the observations.

Figure 13. Unmodelled systematic errors in the residuals when using the manufacturer’s calibration model and parameters. (Left) The magnitude of the residuals. (Right) The x-component of the residuals.

Figure 14. Unmodelled systematic errors in the residuals when using the manufacturer’s calibration model and re-calibrated parameters. (Left) The magnitude of the residuals. (Right) The x-component of the residuals.

Figure 15. Unmodelled systematic errors in the residuals when using our calibration model and after adding the quartic distortion terms described in Equation (9). (Left) The magnitude of the residuals. (Right) The x-component of the residuals.

Figure 16. Residuals from the check data, magnified with a factor of 500.

Figure 17. A test image showing the checkmarks used to assess the accuracy of the 3D reconstruction approaches. The bottom image is a zoomed-in view of the red rectangle in the top image.

Figure 18. 3D Reconstruction, planar approach. The red and blue pyramids show the left and right camera angles.

Figure 19. 3D Reconstruction, matching approach. The red and blue pyramids show the left and right camera angles.

Figure 20. Summary of various approaches used to increase the accuracy of semantic segmentation of concrete cracks.

Table 1. Architecture of SegNet’s Encoder (ResNet50).

Operation	Kernel Size	Stride	Feature Maps	Batch Norm.	Activation Func.	Pool Size
Conv	$7 \times 7$	$2 \times 2$	64	True	ReLU	N/A
Max pool	N/A	$1 \times 1$	N/A	N/A	N/A	3
Residual block	$[1, 3, 1, 1]$	$1 \times 1$	$[64, 64, 256, 256]$	True	ReLU	N/A
Identity block $\times 2$	$[1, 3, 1]$	$1 \times 1$	$[64, 64, 256]$	True	ReLU	N/A
Residual block	$[1, 3, 1, 1]$	$2 \times 2$	$[128, 128, 512, 512]$	True	ReLU	N/A
Identity block $\times 3$	$[1, 3, 1]$	$1 \times 1$	$[128, 128, 512]$	True	ReLU	N/A
Residual block	$[1, 3, 1, 1]$	$2 \times 2$	$[256, 256, 1024, 1024]$	True	ReLU	N/A
Identity block $\times 5$	$[1, 3, 1]$	$1 \times 1$	$[256, 256, 1024]$	True	ReLU	N/A
Residual block	$[1, 3, 1, 1]$	$2 \times 2$	$[512, 512, 2048, 2048]$	True	ReLU	N/A
Identity block $\times 2$	$[1, 3, 1]$	$1 \times 1$	$[512, 512, 2048]$	True	ReLU	N/A
Average pool	N/A	$1 \times 1$	N/A	N/A	N/A	7

Table 2. Architecture of SegNet’s Decoder.

Operation	Kernel Size	Stride	Feature Maps	Batch Norm.	Activation Func.	Up-Sample Size
Conv + upsample	$3 \times 3$	$1 \times 1$	512	True	None	2
Conv + upsample	$3 \times 3$	$1 \times 1$	256	True	None	2
Conv + upsample	$3 \times 3$	$1 \times 1$	128	True	None	2
Conv	$3 \times 3$	$1 \times 1$	64	True	None	N/A
Conv	$3 \times 3$	$1 \times 1$	1	True	None	N/A
Conv	$1 \times 1$	$1 \times 1$	1	True	Sigmoid	N/A

Table 3. Architecture of UNet’s Encoder (ResNet18).

Operation	Kernel Size	Stride	Feature Maps	Batch Norm.	Activation Func.	Pool Size
Conv *	$7 \times 7$	$2 \times 2$	64	True	ReLU	N/A
Max pool	N/A	$2 \times 2$	N/A	N/A	N/A	3
Residual block	$[1, 3, 3]$	$1 \times 1$	$[64, 64, 64]$	True	ReLU	N/A
Identity block *	$[3, 3]$	$1 \times 1$	$[64, 64]$	True	ReLU	N/A
Residual block	$[1, 3, 3]$	$2 \times 2$	$[128, 128, 128]$	True	ReLU	N/A
Identity block *	$[3, 3]$	$1 \times 1$	$[128, 128]$	True	ReLU	N/A
Residual block	$[1, 3, 3]$	$2 \times 2$	$[256, 256, 256]$	True	ReLU	N/A
Identity block *	$[3, 3]$	$1 \times 1$	$[256, 256]$	True	ReLU	N/A
Residual block	$[1, 3, 3]$	$2 \times 2$	$[512, 512, 512]$	True	ReLU	N/A
Identity block	$[3, 3]$	$1 \times 1$	$[512, 512]$	True	ReLU	N/A

Table 4. Architecture of UNet’s Decoder.

Operation	Kernel Size	Stride	Feature Maps	Batch Norm.	Activation Func.	Up-Sample Size
Upsample + * conv	$3 \times 3$	$1 \times 1$	256	True	ReLU	2
Conv	$3 \times 3$	$1 \times 1$	256	True	ReLU	N/A
Upsample + * conv	$3 \times 3$	$1 \times 1$	128	True	ReLU	2
Conv	$3 \times 3$	$1 \times 1$	128	True	ReLU	N/A
Upsample + * conv	$3 \times 3$	$1 \times 1$	64	True	ReLU	2
Conv	$3 \times 3$	$1 \times 1$	64	True	ReLU	N/A
Upsample + * conv	$3 \times 3$	$1 \times 1$	32	True	ReLU	2
Conv	$3 \times 3$	$1 \times 1$	32	True	ReLU	N/A
Upsample + conv	$3 \times 3$	$1 \times 1$	16	True	ReLU	2
Conv	$3 \times 3$	$1 \times 1$	16	True	ReLU	N/A
Conv	$3 \times 3$	$1 \times 1$	1	True	None	N/A

Table 5. Training hyperparameters of SegNet and UNet.

Parameter Name	Value
Optimizer	Adam
Learning rate	$0.0001$
Batch size	4
Epochs	Early stopping (val loss)
Decoder weight initialization	ImageNet
Encoder weight initialization	Glorot uniform
Bias initialization	0
Loss function	BCE/Jaccard

Table 6. Training hyperparameters of SP.

Parameter Name	Value
Optimizer	Adam
Learning rate	0.001
Batch size	256
Epochs	Early stopping (val loss)
Weight initialization	Glorot uniform
Bias initialization	0
Loss function	BCE

Table 7. SegNet performance, Experiment 1.

SegNet—BCE Loss—Training/Inference on Down-Sized Images
Scenarios	P	R	F1	IoU
Baseline	94.48	77.04	82.89	50.72
Traditional augmentation	94.85	81.20	85.83	53.61
GAN augmentation	93.33	79.68	84.21	52.97
GAN + Traditional augmentation	92.92	83.32	86.39	54.99

Table 8. UNet performance, Experiment 1.

UNet—BCE Loss—Training/Inference on Down-Sized Images
Scenarios	P	R	F1	IoU
Baseline	95.29	84.71	88.41	60.60
Traditional augmentation	95.47	86.40	89.58	61.84
GAN augmentation	95.98	84.71	88.76	60.86
GAN + Traditional augmentation	95.31	86.34	89.51	61.38

Table 9. Best SegNet scenarios in eight experiments.

Experiments	P	R	F1	IoU
1. GAN + Traditional augmentation, Trained with non patch, BCE loss, Tested on non patch	92.92	83.32	86.39	54.99
2. Traditional augmentation, Trained with non patch, Jaccard loss, Tested on non patch	93.78	83.32	86.65	56.41
3. GAN + Traditional augmentation, Trained with non patch, BCE loss, Tested on patches	90.04	66.13	73.09	34.79
4. GAN augmentation, Trained with non patch, Jaccard loss, Tested on patches	86.94	62.94	69.11	32.29
5. Traditional augmentation, Trained with both patches and non patch, BCE loss, Tested on patches	92.00	78.80	82.84	48.46
6. Traditional augmentation, Trained with both patches and non patch, Jaccard loss, Tested on patches	91.81	79.36	83.13	50.91
7. Traditional augmentation, Trained with both patches and non patch, BCE loss, Tested on non patch	94.94	79.70	84.47	53.08
8. Traditional augmentation, Trained with both patches and non patch, Jaccard loss, Tested on non patch	95.00	81.24	85.45	56.17

Table 10. Best UNet scenarios in eight experiments. The GAN augmentation scenario marked with an asterisk (*) include the generated images of a GAN trained solely on the zoomed-in image patches as well as the generated images of a GAN trained exclusively on the resized images.

Experiments	P	R	F1	IoU
1. Traditional augmentation, Trained with non patch, BCE loss, Tested on non patch	95.47	86.40	89.58	61.84
2. Traditional augmentation, Trained with non patch, Jaccard loss, Tested on non patch	95.49	84.59	88.24	63.13
3. GAN + Traditional augmentation, Trained with non patch, BCE loss, Tested on patches	92.11	78.78	82.82	45.32
4. GAN + Traditional augmentation, Trained with non patch, Jaccard loss, Tested on patches	92.07	69.84	76.31	42.03
5. GAN * + Traditional augmentation, Trained w/both patches and non patch, BCE loss, Tested on patches	94.68	87.41	89.57	58.01
6. Traditional augmentation, Trained with both patches and non patch, Jaccard loss, Tested on patches	95.29	86.09	88.92	59.19
7. GAN * + Traditional augmentation, Trained w/both patches and non patch, BCE loss, Tested on non patch	96.45	84.66	88.24	61.50
8. Traditional augmentation, Trained with both patches and non-patch, Jaccard loss, Tested on non patch	96.34	85.17	88.45	63.32

Table 11. 3D modeling errors of planar and matching approaches.

Performance Measures (mm)	Planar	Matching
RMS Reconstruction Error	1.23	0.86
StD of Absolute Reconstruction Error	0.8755	0.3798

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shokri, P.; Shahbazi, M.; Nielsen, J. Semantic Segmentation and 3D Reconstruction of Concrete Cracks. Remote Sens. 2022, 14, 5793. https://doi.org/10.3390/rs14225793

AMA Style

Shokri P, Shahbazi M, Nielsen J. Semantic Segmentation and 3D Reconstruction of Concrete Cracks. Remote Sensing. 2022; 14(22):5793. https://doi.org/10.3390/rs14225793

Chicago/Turabian Style

Shokri, Parnia, Mozhdeh Shahbazi, and John Nielsen. 2022. "Semantic Segmentation and 3D Reconstruction of Concrete Cracks" Remote Sensing 14, no. 22: 5793. https://doi.org/10.3390/rs14225793

APA Style

Shokri, P., Shahbazi, M., & Nielsen, J. (2022). Semantic Segmentation and 3D Reconstruction of Concrete Cracks. Remote Sensing, 14(22), 5793. https://doi.org/10.3390/rs14225793

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Segmentation and 3D Reconstruction of Concrete Cracks

Abstract

1. Introduction

2. Methodology

2.1. Dataset Preparation and Augmentation

2.2. Semantic Segmentation Architectures

2.2.1. SegNet

2.2.2. UNet

2.2.3. Structured Prediction

2.3. Training Details

2.3.1. Transfer Learning

2.3.2. Early Stopping

2.3.3. Loss Functions

2.4. Statistical Significance

2.5. Calibration

Calibration Model

2.6. 3D Reconstruction

2.6.1. Planar Approach

2.6.2. Stereo Inference Approach

3. Experiments and Results

3.1. Synthetic Image/Label Generation by GANs

3.2. Semantic Segmentation

Qualitative Assessment

3.3. Calibration

3.4. 3D Reconstruction

3.4.1. Qualitative Evaluation

3.4.2. Quantitative Assessment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI