1. Introduction
For historical buildings and objects that are subject to various damages, cracks represent the most critical type of damage [
1,
2,
3]. However, the task of inspecting the structural conditions of historical surfaces is carried out manually in most cases, which is a costly and subjective process. Thus, developing automated and efficient structural damage identification techniques has attracted the attention of many researchers.
The processes of crack detection and analysis are of great significance, as indicated by the importance of cracks as an indicator of buildings’ safety and durability factors [
4]. Most structures, such as bridges, roads, and buildings, become susceptible to danger and cracking due to the periodical loading, weather, or stress accumulation [
5]. Consequently, continuous structural health monitoring can be a very laborious task for humans, while it is easy for automated systems based on computer vision. Thus, the demand for building automatic crack detection systems has rapidly increased, despite the challenges that they face due to complex real environments, illumination conditions, and the irregular shape of cracks [
6].
Pixel-level crack detection in historical building images is a challenging task because of many sources of uncertainty, such as shadows, ornaments, carvings, separators, crack-like artifacts, and wood patterns [
6]. For the case of crack detection from images, the user only provides different photos as input and receives as output any detected cracks in these photos, without the necessity for any manual intervention. The task of semantic segmentation is to assign each pixel of an image a class label. Manually performing this task is time-consuming, error-prone, and resource-exhaustive, as some cracks may be only one pixel wide. Pixel-level crack detection can be used to perform quantitative analysis on crack width and length to assess the degree of crack severity. Practically, such assessment is effectively used for diagnosing the integrity of a building’s infrastructure, such as bridges, pavements, and other buildings, using image sensors. Historical buildings have various surface structures, dust, ornaments, carvings, wood patterns, separators, corrosion on walls, bird droppings, fungi, detachment, color degradation, other damages, and weathering damages that hinder the ability to recognize the correct abnormality—namely, cracks. As shown in
Figure 1, surface images include corrosion, shadows, as well as different appearances, and textures. In
Figure 1a,b, corrosion and wood patterns are shown, respectively. On the other hand,
Figure 1c,d,h contain different texture patterns, while carvings, separators, and bird droppings are shown in
Figure 1e,f,g, respectively.
This research focuses on utilizing deep learned features for the pixel-level crack detection/segmentation problem. Cracks usually appear on structures as the first visible sign of structural damage. In the last few decades, as a response to the crack segmentation problem, various traditional computer vision-based segmentation approaches, including thresholding, the modified Tubularity Flow Field (TuFF) algorithm, filtering, tensor voting, super-pixel methods, morphological approaches, clustering, and skeletonization techniques, have been introduced in the literature to perform pixel-level crack detection in the images of engineering structures. Moreover, several deep learning-based crack segmentation methods have been widely developed to efficiently detect surface cracks, such as DeepCrack, SegNet, SDDNet, and U-Net [
7,
8,
9,
10,
11]. The significance of this paper is to investigate the impact of utilizing pixel-Level deep crack segmentation through adopting several variants of the U-Net deep learning model for crack detection on historical surfaces.
The main contributions of this paper are summarized as follows:
Developing an automated pixel-level detection approach through assessing various U-Net deep learning architectures for handling the problem of deep crack segmentation on historical surfaces.
Investigating two loss functions, namely Dice and cross-entropy (CE), in addition to a third hybrid loss function, for training and enhancing the performance of the proposed approach.
Constructing an expert-annotated primary dataset of crack images on historical surfaces, collected over two years from a historical location in Historic Cairo, Egypt.
Applying a contrast stretching method for handling the impacts of different environmental conditions on images of historical surfaces.
Building an extra semi-supervised pixel-level map generation module for annotating historical surface images, to avoid the cost of pixel-by-pixel manual annotation.
The remainder of this paper is organized as follows.
Section 2 introduces the state-of-the-art literature related to the pixel-level crack detection problem.
Section 3 describes the different phases of the proposed pixel-level crack detection approach, namely the data acquisition, data preparation, pixel-level map generation, and pixel-level crack segmentation phases.
Section 4 presents details of the crack image datasets and discusses the obtained experimental results.
Section 5 discusses the research conclusions and addresses recommendations for future research.
2. State-of-the-Art Studies for Crack Segmentation
This section reviews the most relevant literature addressing pixel-level crack detection techniques.
In [
11], Dais et al. proposed automatic crack classification and segmentation methods based on a Convolutional Neural Network (CNN) and transfer learning schema for masonry surfaces. The performance of different pre-trained deep learning models was investigated for the crack detection task. Then, both Feature Pyramid Networks (FPN) and U-Net with the best model as the encoder were used for pixel-level crack detection. Meanwhile, in [
7], Y. Liu et al. proposed an end-to-end pixel-wise crack segmentation method based on a deep hierarchical CNN having the power of learning and effectively aggregating multi-scale and multi-level features from the lower convolutional layers to the higher ones. Moreover, a loss function was especially designed to alleviate the problem of imbalanced data distribution. Another pixel-level, end-to-end, trainable deep CNN for road crack detection, able to handle strong non-uniformity, complex topology, and strong noise-like problems in the crack images, was proposed in [
12] by Song et al., using a multi-scale dilated convolution module to obtain more abundant crack texture information. Furthermore, Dung in [
13] proposed an end-to-end deep fully convolutional network (FCN)-based crack detection method for semantic segmentation on concrete crack images.
Moreover, in [
14], Zhang et al. presented a novel crack segmentation method based on a context-aware deep convolutional network to segment cracks effectively in structural infrastructure under different conditions. Meanwhile, in [
15], Kang et al. proposed a hybrid crack detection, localization, and measurement method based on a Faster R-CNN, a modified Tubularity Flow Field (TuFF) algorithm, and a modified Distance Transform Method (DTM) for pixel-level crack segmentation.
From another perspective, J. Liu et al. in [
16] proposed a U-Net-based crack image segmentation network for pavement cracks with a one-cycle training schedule for speeding up the convergence. The proposed segmentation network replaced the encoder part of the U-Net neural network with a pre-trained ResNet-34 neural network.
Ghorbanzadeh et al. in [
17] proposed a framework for landslide detection that integrated the ResU-Net deep learning model with rule-based object-based image analysis (OBIA) to improve the detection rate of the ResU-Net model. Based on evaluating the performance of the proposed integrative framework, using the Sentinel-2 imagery dataset, the F1 score value was enhanced by 8% and 22% against the F1 score values achieved by the ResU-Net and OBIA approaches, respectively.
In [
18], Jia et al. proposed an integrated pixel-level crack detection and quantification approach based on DeepLabv3+ to automatically detect and quantify cracks in asphalt pavement images for operation and maintenance. In addition, they used adjustment of the weight of the different classes to address the problem of class imbalance, skeletonizing, and fast parallel thinning (FPT) techniques for measuring the length, width, area, and ratio of cracks. In [
10], Chen et al. presented an encoder–decoder model based on a modified SegNet architecture, which handled arbitrary-sized images for pixel-level crack detection in concrete pavement, a bridge deck, and asphalt pavement, aiming at improving the performance by merging the features of low-resolution input samples.
Li et al. in [
19] proposed a pixel-level crack detection method based on Deep Local Pattern Predictor (DLPP) for concrete bridge images, aiming at handling the limitations resulting from noise and clutter in the environment. Finally, in [
20], Guzman-Torres et al. proposed an open-source crack detection platform based on an improved VGG-16 transfer learning model. The proposed platform in this study was capable of detecting multi-scale concrete structure cracks. Moreover, the impact of various DL architectures, regularization techniques, network depths, and transfer learning methods was examined. The proposed approach achieved accuracy and an F1 score of 99.5% and 100%, respectively.
In general, based on the surveyed related literature, several aspects related to crack detection have been addressed. However, there is a very limited number of studies addressing the problem of pixel-level crack detection in the field of historical heritage. Consequently, the approach proposed in this paper investigates the performance of adopting pixel-level deep crack segmentation models on historical surfaces using variants of the U-Net deep learning model in order to address segmentation problems with limited amounts of data.
Table 1 summarizes the presented exhaustive survey of state-of-the-art studies related to crack segmentation based on deep learning.
3. The Proposed Deep Crack Segmentation Approach
As stated in the previous sections, this paper proposes an end-to-end
-Net based pixel-level crack segmentation approach aiming to generate a pixel-level crack map. It consists of four phases, as shown in
Figure 2, namely
(1) data acquisition phase, which is responsible for collecting surface crack images of historical buildings;
(2) data preparation phase, which is responsible for preparing and preprocessing the images for training;
(3) pixel-level map generation, which is responsible for annotation of the primary dataset, and
(4) pixel-level crack segmentation phase, which investigates three different U-Net architectures, namely Deep ResU-Net, ResU-Net++, and
-Net.
3.1. Data Acquisition Phase
In this phase, real data of surface cracks are collected from an ancient building suffering from damage problems. One of these problems is the presence of cracks, which are defined as distinguishable damage with the human eye. A primary dataset of 40 raw images of the building surfaces was captured using a digital Canon camera (EOS REBEL T3i) with resolution. The image collection location was the mosque (Masjed) of the Amir Altunbugha Al-Maridani, located in Sekat Al Werdani, “Bab-Al-Wazir” street, in the El-Darb El-Ahmar district of Historic Cairo, Egypt, with location coordinates 30.03974 N 31.25922 E. This mosque was built during the era of the Mamluk Sultanate of Cairo, Egypt, in 1339-40 CE. It is distinguished by its octagonal minaret and its large dome and is considered one of the most distinctively decorated historical buildings. The collected images of the building surfaces were captured using a digital Canon camera (EOS REBEL T3i) with resolution, over two years (2018 and 2019), before completing the restoration and rehabilitation project of the mosque in 2020.
3.2. Data Preparation Phase
In this phase, the image data are prepared and preprocessed for the next pixel-level crack segmentation phase. Samples of cracks in the primary dataset are shown in
Figure 3. The data preparation phase comprises multiple steps:
- (1)
Image bank generation: raw images are divided into ( pixel resolution) sub-images;
- (2)
Filtering: only sub-images with cracks are considered, while intact ones are ignored, and
- (3)
Augmentation: several spatial transformations are applied systematically, as follows [
21]:
- 1.
Flipping images vertically;
- 2.
Flipping images horizontally;
- 3.
Flipping images vertically, then horizontally;
- 4.
Rotating images vertically by 90°, then by −90°, individually;
- 5.
Combining the output images of the previous steps with the original images to establish a new dataset ().
3.3. Semi-Supervised Pixel-Level Map Generation
This phase aims to build up the training dataset for crack segmentation. As is known, pixel-level annotation is extremely expensive and labor-consuming, especially for critical domains such as structural damages, requiring expert dedication. This module only requires a bounding box around the object; then, GrabCut [
22] will work iteratively to generate the mask instead of marking pixel by pixel. This saves time, effort, and cost, as shown in
Figure 4.
GrabCut is an iterative segmentation method, which is an extension of the Graphcut method. Gaussian mixture models (GMMs) are used for color images instead of the model for monochrome images. GMMs are utilized to learn the color distributions of the background and foreground by assigning each pixel a probability to belong to a group of other pixels [
22,
23]. Given a color image
I, let us consider the array
of
N pixels, where
in the
space. The segmentation of the image is presented as an array of opacity values
at each pixel with
, with 0 and 1 for background and foreground, respectively. Two GMMs for the background and the foreground are taken to be the full-covariance Gaussian mixture with
K components. In the optimization framework, an extra vector
is introduced, with
, assigning each pixel a unique GMM component either from the foreground or the background model according to
to easily deal with the GMM [
22,
23]. The Gibbs energy function
E defined by GrabCut for segmentation, such that its minimum value should agree with a good segmentation, and it is computed as follows:
The data term
U evaluates the fit of the opacity distribution
to the data
z, given the histogram model
, and is defined to be
where
,
is a Gaussian probability distribution, and
are mixture weighting coefficients, so that (up to a constant)
Therefore, the parameters of the model are as follows:
where
are the weights, mean, and covariance values of the 2K Gaussian components for both background and foreground distributions. Meanwhile, the smoothness term
V is computed as follows [
22,
23]:
The original GrabCut segmentation works as shown in Algorithm 1 [
22,
23].
Algorithm 1 GrabCut pixel-level map generation |
- 1:
Obtain a bounding box b around the region of interest. - 2:
Initialize trimap T by supplying only . The foreground is set to = 0; , complement of the background. - 3:
Make for and for as initial segmentation. - 4:
Initialize GMMs for each of background and foreground from sets and , respectively. - 5:
Assign GMM components to pixels: for each n in ,
- 6:
Learn GMM parameters from data z:
- 7:
Estimate segmentation: use min-cut formula
- 8:
Repeat steps 5, 6, and 7 until convergence - 9:
Apply border matting method
|
3.4. Segmentation-Based Variant
U-Net Models
U-Net is an encoder–decoder convolutional network that utilizes skip connections for preserving features at multi-resolution, used to solve end-to-end semantic segmentation tasks. Skip connections are a critical component of conventional deep neural networks (DNNs) such as DenseNet, ResNet, ResNeXt, and WideResNet. Skip connections build a short-cut from shallow layers to deep layers by directly connecting the input of a convolutional block/the residual module to its output. Their task throughout the network is to speed up the learning process, preserve low-level features, and avoid the problem of vanishing gradient in deep models [
24,
25,
26].
In general, U-Net comprises two parts, namely the encoder part and the decoder part. The encoder part is responsible for taking an image as input and extracting features at multiple scales and abstraction levels, thus yielding a multi-level, multi-resolution feature representation. It is a simple down-sampling path consisting of stacked convolutional blocks with a max-pooling operator, used as dimensionality reduction, which can be replaced by a deeper network such as VGG or ResNet. Meanwhile, the decoder part controls the reconstruction of the probability segmentation maps, where its task is to take the feature representation and classify all pixels at the original image resolution in parallel. U-Net has the advantage of performing well on segmentation problems with limited amounts of data [
24,
25,
26]. In this phase, three different U-Net based models along with two different loss functions, namely Dice and cross-entropy (CE), are utilized for pixel-level crack detection, as shown in Equations (9) and (10). Moreover, benchmark datasets of various crack types and severity are utilized for training the proposed model.
where
p is a predicted map and
y is its corresponding ground truth for class
j.
N and
C are the number of pixels and the number of classes (excluding the background), respectively. Finally,
is a smoothing constant to avoid dividing by zero.
Moreover, contrast stretching is used as a prepossessing step during the testing time to handle the problems arising from different environmental conditions. In addition, the historical building’s concrete has various surface structures, dust, ornaments, carvings, wood patterns, separators, corrosion on walls, bird droppings, fungi, detachment, color degradation, other damages, and weathering damages that hinder the ability to recognize the correct abnormality, mainly cracks.
3.4.1. Deep Residual Unet (ResUnet)
Deep ResUnet is a U-Net-based segmentation neural network that integrates the strengths of U-Net and residual neural network, as shown in
Figure 5. This integration yields two benefits: (1) the ease of network training due to the residual unit; (2) the facility of information propagation without degradation, due to the skip connections inside a residual unit and between both low and high levels of the network, resulting in the possibility of designing a neural network with a much smaller number of parameters while achieving similar or even better performance on tasks of semantic segmentation [
25,
26,
27].
3.4.2. ResUnet++
The ResUNet++ architecture is a Deep ResUNet-based semantic segmentation deep neural network that combines the strengths of residual blocks, Atrous Spatial Pyramidal Pooling (ASPP), squeeze and excitation blocks, and attention blocks, as shown in
Figure 6. As stated in [
28], attention maps improve image classification by highlighting relevant and suppressing misleading information such as the background. The attention mechanism is widely employed in Natural Language Processing (NLP) and semantic segmentation tasks, such as pixel-wise prediction. It pays attention to the subset of its input feature map, where its task is to determine which parts of the network require more attention. The advantages of the attention mechanism are that it reduces the information encoding computational cost, is simple, and improves the results. The ASPP is a module having the ability to precisely capture multi-scale information and add more global contextual information for more robust and accurate classification. The ASPP module consists of several parallel atrous convolutions with different expansion rates to investigate its input feature map at a specific effective field-of-view for precisely extracting information at multi-scale. Moreover, the global average pooling adds more global contextual information [
26,
27].
Figure 5.
Architecture of Deep Residual U-Net (ResU-Net).
Figure 5.
Architecture of Deep Residual U-Net (ResU-Net).
Moreover, the squeeze and excitation block (SE) is an architectural unit constructed to enhance the network’s representational power by enabling it to fulfill the dynamic recalibration of the channel-wise feature. It can learn to use global information to automatically acquire the importance of each feature channel. It then selectively emphasizes useful features and suppresses nonproductive ones. Meanwhile, the residual block comprises multiple combinations of convolutional layers, batch normalization, ReLU activation, and identity skip connections. It aims to ease the training process and improve the network’s representation ability [
29].
3.4.3. -Net
-Net is a semantic segmentation neural network designed for salient object detection (SOD). It is a nested U-structure, with exactly two levels, as shown in
Figure 7. A novel ReSidual U-block (RSU) was inspired by the structure proposed in [
29] to collect intra-stage multi-scale features. The structure of RSU block is shown in
Figure 8, where
represents the number-of-layers in the encoder and
,
are the input and output channels, respectively. Also,
is the number-of-channels in the internal layers of the block.
Briefly, the RSU block comprises three main components:
- 1.
The input convolution layer is responsible for transforming the input feature map to an intermediate one;
- 2.
A L height U-Net is responsible for learning how to elicit and encode the multi-scale contextual information using the intermediate feature map as input;
- 3.
The residual connection is responsible for fusing both local features and multi-scale features by a summation operator.
Thus, the
-Net has the advantage of the ability to capture more contextual information from multi-scales and increase the whole architecture’s depth without notably increasing the cost of computation thanks to the RSU block [
29].
3.4.4. Utilized U-Net-Based Models
Deep ResU-Net model: In this paper, a nine-level architecture of deep ResU-Net is utilized along with two different loss functions for pixel-by-pixel crack detection. All levels are built with residual blocks comprising two convolution blocks and an identity mapping connecting both the input and output of the block.
ResU-Net++ model: The original architecture of ResU-Net++ is utilized along with two different loss functions for pixel-by-pixel crack detection with the filter numbers [16, 32, 64, 128, 256]. The filter number was selected based on experiments.
-Net model: Moreover, the original architecture of -Net is utilized along with two different loss functions for pixel-by-pixel crack detection.
4. Experimental Results
In this study, simulation experiments were performed on a Kaggle kernel with an NVIDIA TESLA P100 GPU and 16 GB memory. The proposed approach was designed with Tensorflow and Pytorch using a Python environment on the Linux platform. To evaluate the proposed models, five performance metrics, namely accuracy, precision, recall, Dice coincidence index (Sørensen similarity coefficient), Jaccard coefficient, and IoU, were calculated according to Equations (11) to (16), respectively [
30].
The TP, FP, TN, and FN terms are defined as follows:
True Positive (TP): the pixel is a crack and is classified as a crack;
False Positive (FP): the pixel is intact and is classified as a crack;
True Negative (TN): the pixel is intact and is classified as intact;
False Negative (FN): the pixel is a crack and is classified as intact.
4.1. Dataset Description
This subsection describes the different characteristics of the utilized datasets in this study, as follows:
- 1.
The open crack detection dataset [7] is a benchmark dataset, consisting of a total of 537 images with manual annotation maps. It is divided into 300 and 237 images as training and testing datasets, respectively.
- 2.
The CrackForest dataset [31] is another benchmark crack detection dataset, consisting of a total of 118 images.
- 3.
The primary dataset consists of a total of 263 crack images of historical surfaces with ornaments, carvings, wood patterns, separators, and corrosion on walls.
All models were trained on a mixed dataset, consisting of the augmented CrackForest [
31] dataset and an open crack [
7] training dataset, using rotation and flipping, with a total of 2508 crack images.
Moreover, the proposed approach was subsequently tested on a mixed dataset of two different datasets (the open crack detection testing dataset and the primary dataset of the historical building) using 500 crack images.
4.2. Results and Discussion
All models were trained for 50 epochs. The training started with a batch size of 16 for Deep ResU-Net and ResU-Net++ models and 12 for the
model.
Figure 9 shows samples of the training dataset. Moreover, samples of GrabCut results are shown in
Figure 10.
As shown in
Table 2, which presents the performance metrics of evaluating the Deep ResU-Net model, the best model is the one trained using cross-entropy (CE) as a loss function and tested on enhanced images using contrast stretching. It is noticed that using contrast stretching as a preprocessing step in the testing phase enhances the performance by increasing the mean IoU (mIoU) from 67.069% to 74.959% for the model that uses CE loss. In general, the performance of the Deep ResU-Net model is improved when tested on preprocessed images by contrast stretching.
Table 3 shows the performance metrics of evaluating the ResU-Net++ model.
From
Table 3, it is noticed that the performance of the ResU-Net++ model is improved when tested on preprocessed images by contrast stretching.
Table 4 and
Table 5 show the performance metrics of evaluating the ResU-Net++ model on each testing dataset individually, with and without contrast stretching.
Table 6 shows the performance metrics of evaluating the
-Net model.
As noticed from
Table 6, the best model is the one trained using the Dice coefficient as a loss function by achieving an mIoU and Dice score of 80.922% and 75.523%, respectively. It is noticed that using contrast stretching as a preprocessing step in the testing phase has a slight effect on the performance. It does not enhance the performance.
At this point, the
-Net architecture utilized for the crack segmentation task is highlighted since it obtained the best results, as concluded from
Table 2,
Table 3 and
Table 6.
Table 7 shows the performance metrics of evaluating the
-Net model on each testing dataset individually.
It is concluded from
Table 7 that the proposed crack segmentation approach based on the
-Net model outperforms the DeepCrack [
7] approach without any preprocessing or post-processing by achieving an mIoU of 83.87%. Moreover, the proposed approach performs well on the primary dataset, although it is excluded from training.
Figure 11 shows some result samples of
-Net on the primary dataset.
4.3. Comparative Analysis
As shown in
Figure 12a,
Figure 13a,c,d and
Figure 14d,e, ResU-Net++ without contrast stretching has difficulty in dealing with blurred images, where it cannot detect all or most of the crack pixels.
On the other hand, it is noticed that using contrast stretching as a preprocessing step during the testing phase helps in improving the performance of ResU-Net++ when dealing with blurred images, where it increases the Dice score from 37.98% to 61.42%. Although contrast stretching enabled ResU-Net++ to detect a higher percentage of cracks on historical surfaces, it also mistakes shadows for cracks. Conclusively, ResU-Net++, both with and without contrast stretching, lacks the ability to handle images with shadows.
The main advantage of using the
-Net model for crack segmentation is its capability of detecting tiny cracks and unlabeled cracks, as in
Figure 15b,c. Also, the
-Net model shows outperformance for detecting cracks on edges, as in
Figure 15b and cracks in blurred images, as shown in
Figure 12a and
Figure 13a,c,d. Moreover, the model is capable of detecting cracks in images with varying lighting conditions that result in shadow, as in
Figure 12d.
Conversely, the main limitation of crack segmentation based on the
-Net model is dealing with deep circle patterns with shadows, as shown in
Figure 12b, and patterns with multi-level depth, as in
Figure 15b. This is due to the nature of the
-Net model for detecting salient objects, as, when moving between two or more different depths, a part of the boundary between depths is mistakenly marked as a crack.
5. Conclusions and Future Work
This paper presents a study highlighting the effectiveness of implementing pixel-level deep crack segmentation models on historical surfaces. The significance of the proposed approach is revealed through adopting variants of the U-Net deep learning model for handling segmentation problems with limited amounts of data.
Thus, in this paper, the proposed automated approach examined three U-Net-based deep learning models, namely, Deep ResU-Net, ResU-Net++, and -Net, for pixel-level crack segmentation. Decisively, it was observed that the -Net semantic segmentation neural network model outperformed other tested U-Net models based on its ability to recognize more contextual information from multi-scales and increase the whole architecture’s depth with a reasonable computation cost.
Moreover, in order to tackle the challenge of limited historical surface data in the literature, a primary dataset is generated, for testing the proposed approach, containing crack images of historical surfaces with various crack types, sizes, and severity on several complex backgrounds of ornaments, carvings, wood patterns, separators, and corrosion. To the best of the authors’ knowledge, this research is the first to utilize the U-Net deep learning model for pixel-level deep crack segmentation on images of historical surfaces.
It is observed that the performance of the proposed approach using the
-Net model deteriorates, with Dice score, mIoU, and Jaccard measures declining from 80.52% to 71.09%, from 83.78% to 78.38%, and from 67.392% to 55.147%, respectively, when tested on the historical surfaces dataset compared to when tested on the open crack dataset [
7]. The
-Net model is known for detecting salient objects, which results in some cases of mistakenly detecting the boundaries as cracks. However, despite this observation, it still surpasses the performance reported in the literature considering the challenges previously stated regarding crack detection on historical surfaces. Furthermore, several state-of-the-art variant U-Net models are examined for their efficacy to classify crack images from historical surfaces on a pixel level, with the highest obtained Dice score of 71.09%. In particular, for historical surfaces, when the ResU-Net++ is considered with contrast stretching, the Dice score increases from 37.98% to 61.42%, which demonstrates the beneficial effect of using contrast stretching as a preprocessing step during the testing phase.
The most significant findings of the proposed approach in this research surpassed those of the other proposed approaches in the literature with comparable experiments for crack segmentation on concrete or asphalt surfaces. However, several challenges still remain to be addressed for future research, such as running more experiments considering other types of surfaces. Furthermore, although the proposed deep segmentation approach achieved promising results considering the crack defect on historical surfaces, generating more annotated images considering other defect types, such as corrosion, bird droppings, and shadows, should be considered for future research work in order to enrich the currently available dataset. Moreover, additional experiments considering images under low lighting and/or other environmental conditions should reveal the sensitivity of the proposed approach to image quality.