1. Introduction
The majority of masonry structures are categorized as “Registered Structures” and represent a significant part of our “Cultural Heritage”, thereby necessitating their preservation and appreciation [
1]. Unfortunately, numerous historical buildings in our nation are in a state of disrepair. Earthquakes, ground-related issues, fires, and degradation resulting from environmental factors have notably impacted both the structural integrity and aesthetics of these constructions. The occurrence of cracks in many historical buildings is often attributable to irregularities and disconnections within their foundations, potentially leading to partial or complete collapse. Rather than resorting to “demolition and reconstruction” approaches, there is a pressing requirement for “repair and maintenance” strategies when dealing with historical buildings. Presently, conventional techniques such as visual inspections and manual surveys are employed to assess the existing masonry structures for signs of damage, including cracking and fragmentation. However, these methods are labor-intensive and prone to individual errors [
2,
3]. Previous attempts to repair cracks and the structural systems of these buildings have proven ineffective due to a lack of investigation into the underlying causes of such damages. Prior to strengthening a damaged historical building, a comprehensive evaluation of its ground properties, load-bearing system, utilized materials, and historical background is imperative. These investigations should be conducted meticulously to determine the appropriate interventions and strengthening measures. Neglecting the essential investigative procedures may result in unintended damage during preservation efforts. Hence, a conscious effort must be made to safeguard and fortify historical structures before it becomes too late.
The visual inspection process is commonly employed to examine and assess the present condition of historical buildings. However, this method entails challenges in terms of its laborious and time-consuming nature, necessitating inspectors with expertise and specialized knowledge to evaluate structural conditions based on visual observations. Furthermore, the frequency of inspections is limited due to the associated high labor costs, proneness to human error, and difficulties in accessing inspection sites. Recognizing the constraints inherent in manual inspections, the use of computer vision-based assessments has become increasingly prominent in the field of engineering [
4,
5,
6,
7,
8]. Notably, the application of computer vision techniques for crack detection has garnered substantial attention from researchers. Image-based crack detection emerges as an exemplary non-destructive evaluation approach, particularly valuable for historic structures subject to stringent regulations that limit even minor interventions mandated by conservation authorities.
In recent times, deep learning (DL), a branch of artificial intelligence focusing on leveraging image-based inspection techniques, has emerged as a powerful method for analyzing structural damage. One prominent tool within DL is the convolutional neural network (CNN), which proves effective in this context. Unlike traditional machine learning approaches, DL possesses the capability to autonomously learn features and automatically detect damages without the need for manual feature definition. In the crack detection phase, various photos serve as input, and the network outputs the identified cracks without any manual intervention [
9,
10,
11]. Particularly, studies conducted on classifying damage types in historical monuments and structures, as well as determining the affected areas, have significantly elevated the relevance of deep learning methodologies in the present day [
12,
13,
14,
15,
16,
17].
Segmentation methods represent an advanced approach for examining cracks in structures. Within this domain, classic networks, convolutional neural networks (CNN), and fully connected networks (FCN) are noteworthy examples of deep learning-based segmentation methods. To achieve satisfactory accuracy in their results, CNN and FCN architectures necessitate a well-trained, extensively labeled, or annotated dataset [
11,
18,
19,
20,
21,
22,
23,
24,
25]. Nonetheless, these architectures exhibit a high potential for detecting intricate features. Another technique called DL-based transfer learning can be employed, entailing the pre-training of a CNN model using a larger dataset, subsequently enhancing performance when training with smaller datasets. Transfer learning is applicable to any CNN and FCN (fully convolutional networks) architecture [
26]. Deep learning-based automated crack detection systems have been developed for masonry structures [
27,
28]. These systems utilize deep convolutional neural networks and image-processing techniques to detect cracks on concrete and masonry surfaces [
29,
30]. The models are trained on labeled and augmented image datasets, achieving high accuracy in crack classification. The proposed methods have been benchmarked on historical masonry structures, such as the Khaju Bridge in Iran, and have demonstrated accurate identification of both major and minor crack patterns. These automated crack detection systems provide a key step towards automated damage inspection and health evaluation for infrastructure, enabling reliable retrofitting interventions and conservation strategies. Cardellicchio et al. [
31] focused on combining visual inspections with automated recognition of bridge defects using deep learning techniques. Their aim is to develop non-invasive investigation protocols for effective defect control and interpretation of results for bridge inspections. Cardellicchio et al. [
32] studied using YOLOv5, a single-stage object detection model, for automatic defect detection on existing bridges. A database of typical defects was gathered and labeled by domain experts, and YOLOv5 was trained, tested, and validated. The results showed good effectiveness and the accuracy of the proposed methodology, demonstrating the potential of artificial intelligence for automatic defect detection on bridges.
Recovering compressed images for automatic crack segmentation can be achieved using generative models. Generative adversarial networks (GANs) have been used in several papers to address this problem. One approach replaces the sparsity regularization with a generative model that effectively captures a low-dimensional representation of targeted images [
33]. Another paper proposes a road crack segmentation method based on GANs, where the generator generates fake crack images similar to real ones, and the discriminator distinguishes between real and fake crack images [
34]. Furthermore, a GAN-based neural network called CrackSegAN is proposed for automatic pavement crack segmentation, achieving high F1 scores on different datasets [
35]. Finally, an automatic concrete infrastructure crack semantic segmentation method is proposed using deep learning, which shows optimal performance for crack image segmentation [
36].
U-Net represents a deep fully convolutional network (FCN) specifically designed for biomedical image segmentation and exhibits favorable performance when applied to datasets with limited width [
37]. Leveraging its proficiency in detecting fine edges, U-Net has emerged as a benchmark for image segmentation across various domains and has found extensive utility in structural examinations [
38]. Notably, U-Net has been successfully employed for segmenting brick or stone elements within masonry structures. The U-Net architecture has shown exceptional capabilities in segmenting bricks and detecting defects in masonry structures, showcasing its effectiveness in these tasks. Moreover, the U-Net architecture proves to be a suitable choice for automatic documentation of masonry structures through digital images and structural inspections, fulfilling the necessary algorithmic requirements [
26,
39].
In this research, a novel crack detection system is proposed for the segmentation of cracks found in historical masonry structures. The system utilizes a pre-trained convolutional neural network (CNN) model for deep learning. The model is developed by integrating Grad-CAM visualization [
40] and K-Mean clustering methods [
41] with pre-trained CNN models, resulting in the CAM-K-SEG segmentation model. A dataset comprising photographs captured from numerous historical buildings is utilized to train the model. The performance of the proposed method was gauged through a comparative analysis with the U-Net segmentation model.
2. Methodology
Figure 1 depicts the detailed structure of the proposed CAM-K-SEG model [
42], which combines the class activation map (CAM) method and U-Net model for crack segmentation. Initially, transfer learning was employed to perform classification analysis on image data categorized as cracked or non-cracked using pre-trained models. The CAM technique was utilized to identify the crack class within the dataset and highlight distinctive image regions. This enabled a deeper understanding of the areas crucial for image classification with the employed deep learning model. Several pre-trained CNN models, including VGG-16 and VGG-19 [
43], Inception-V3 [
44], Xception [
45], and ResNet50 [
46], alongside a custom CNN model, were utilized in this study, and their performance metrics were compared. The best-performing model was selected for subsequent analyses. Using this model, a deep learning-based segmentation model, referred to as the CAM-K-SEG model, was developed to detect crack areas in historical masonry images. The model was obtained by combining the Grad-CAM visualization and K-Mean clustering methods with pre-trained CNN models. For the segmented crack areas, bounding boxes were generated and compared with ground truth bounding boxes, and the IoU (Intersection-over-Union) metric values were computed. Additionally, the U-Net method was employed to detect cracks in the walls of historical masonry buildings. This approach involved manual labeling for crack area detection and semantic segmentation tasks on crack damage images. The dataset was partitioned into sets for training and testing, ensuring that the test images were not employed during the training phase. After training, predictions were compared with ground truth masks, and IoU values were obtained. The obtained IoU performance metrics between U-Net and CAM-K-SEG were subsequently compared.
In addressing the methodologies employed in this study, it is pertinent to elucidate the distinct yet complementary roles of the CAM-K-SEG and U-Net models. While these models inherently serve different functions, their integration within the research framework is strategically aligned with the overarching objective of enhancing crack detection in masonry structure walls. The U-Net model, renowned for its acute sensitivity and precision in segmenting crack areas, excels in providing detailed pixel-level analysis. This granular approach is indispensable for a thorough assessment of crack severity and extent. Conversely, the CAM-K-SEG model, a novel development in the study, is designed to identify the presence and locations of cracks. Utilizing a rectangular area methodology to compute Intersection-over-Union (IoU) metrics, the CAM-K-SEG model adeptly pinpoints potential areas of structural concern, offering a rapid, macro-level perspective. This dual-model approach, encompassing both the U-Net and CAM-K-SEG models, thus furnishes a comprehensive toolkit for crack detection. It harmonizes the detailed segmentation capabilities of the U-Net model with the broader, location-focused analysis of the CAM-K-SEG model, thereby ensuring a holistic assessment of structural integrity in historical masonry structures.
This study’s primary theoretical contribution lies in the development and validation of the CAM-K-SEG model, a novel approach that integrates class activation mapping with deep learning for the precise detection of cracks in historical masonry structures. This integration represents a significant advancement in the application of deep learning techniques to structural health monitoring, particularly in the nuanced field of heritage conservation.
Methodologically, the study distinguishes itself in several ways. Innovative Model Architecture: The CAM-K-SEG model’s architecture, underpinned by the ResNet50 framework, is specifically tailored to enhance crack detection accuracy and efficiency, demonstrating a novel application in the realm of structural analysis.
Segmentation and Localization: A pivotal aspect of our methodological contribution is the innovative integration of K-Mean clustering with classification techniques using pre-trained models. This novel approach represents a significant methodological advancement in the field of structural analysis, specifically tailored for the identification and delineation of crack areas in complex masonry textures.
K-Mean Clustering for Enhanced Segmentation: The use of K-Mean clustering in our model facilitates the effective segmentation of crack areas. This technique allows for the precise grouping of similar pixels, aiding in the accurate identification of crack patterns amidst the diverse textures of masonry structures.
Classification with Pre-trained Models: The application of classification methodologies using pre-trained models, such as ResNet50, offers a robust framework for the localization and classification of identified crack areas. This integration enhances the model’s ability to accurately detect and categorize cracks, leveraging the strengths of advanced deep learning architectures.
Dataset Curation and Processing: The creation of a specialized dataset, which encompasses a diverse range of crack patterns under varying environmental conditions, and its meticulous preprocessing and annotation represent a methodological advancement in dataset preparation for structural analysis tasks.
Performance Benchmarking: The comparative analysis of the model against established models like U-Net in terms of segmentation capabilities provides a novel perspective on the efficacy of different deep learning approaches in the context of crack detection.
Practical Application and Scalability: The approach demonstrates practicality and scalability, emphasizing less labor-intensive methods and potential adaptability to various types of masonry and environmental conditions, thus broadening the scope of application in structural health monitoring.
2.1. Dataset
Dataset Description: This study employed a specialized dataset titled “High-Resolution Crack Imagery for Structural Analysis of Historical Masonry in Trabzon Province”, meticulously assembled to support the detection and analysis of crack patterns on historical masonry walls. This section elaborates on the dataset’s source, features, processing, and labeling.
Source and Composition of the Dataset: A collection of 502 high-resolution images was curated from various historical masonry buildings across Trabzon Province, Turkey, capturing a diverse array of crack instances. These images incorporate variability in seasonal weather and lighting conditions to ensure robustness and generalizability. Each photograph was taken using a 12-megapixel smartphone camera, providing the necessary detail for subsequent analysis.
Features and Processing: The dataset showcases a vast spectrum of crack patterns, essential for the comprehensive structural analysis aimed at preserving historical structures. To maintain consistency across the dataset, original images of varying sizes were resized to a uniform dimension of 128 × 128 pixels.
Labeling and Annotation: A meticulous manual labeling process was undertaken using Paint.net version 4.2.16, where each pixel was categorized as crack or non-crack, yielding a fully annotated set of images. For segmentation purposes, U-Net methodology was employed, with crack areas masked to generate a corresponding set of labeled images. Data augmentation techniques, including rotation, closure, and translation, were applied to enrich the dataset and bolster the model’s predictive localization performance.
Ground Truth and Partitioning: For the purpose of crack damage detection, LabelImg version 1.8.3 was utilized to mark the images with ground truth bounding boxes, facilitating the identification of crack regions. The dataset was methodically divided into training, validation, and testing sets in a 7:2:1 ratio, providing a structured framework for training and evaluating the pre-trained models.
Evaluation Metrics: To assess the model’s performance, a subset of 10 images was selected for in-depth analysis. The Intersection-over-Union (IoU) metric was employed to evaluate the degree of accuracy in predicted segmentation against the manual annotations, thus validating the model’s efficacy in localizing and identifying crack areas.
This study utilizes a specialized dataset comprising a select number of high-resolution images, with each meticulously focused on crack areas within masonry walls. This dataset is characterized by its diversity in crack patterns and conditions, specifically chosen to enhance the accuracy of the deep learning model in crack detection. Concentrating exclusively on the cracks ensured clarity and detail in the features crucial for structural analysis, thereby eliminating background distractions. This targeted approach is instrumental in developing a reliable tool for identifying and assessing structural integrity in historical masonry structures, with potential future expansions to include broader structural contexts.
Figure 2 illustrates examples of (a) a historical masonry wall and (b) a crack dataset.
2.2. Models and Computational Sources
This research aimed to compare the performances of several convolutional neural network (CNN) models, namely the custom CNN model, VGG-16, VGG-19, Inception-V3, Xception, and ResNet50. The custom CNN model was specifically designed for this study by sequentially stacking layers to form a neural network architecture. The main goal of this model is to diminish the dimensionality of the input images and to extract significant patterns pertinent to each class. To achieve this, the custom CNN model employed five convolutional blocks, each with an increasing number of filters compared to the previous block, facilitating finer-scale feature extraction. Within each block, a ‘Conv2D’ convolutional layer was employed, applying 16 filters of size 3-by-3 pixels to different regions of the input image. This process generated 16 feature maps, which represented the spatial distribution of specific features in the image. Furthermore, within each block of the model, a ‘MaxPooling2D’ layer was employed to downsample the feature maps. This layer consolidates 2-by-2-pixel grids into a single pixel, retaining the highest activation value from each grid. Consequently, the custom CNN model showcased its proficiency in capturing and representing critical features for the image analysis task.
The custom CNN model employed in this study consists of 512 feature maps, each with a size of 5 by 5 pixels. In order to generate predictions, these feature maps underwent a process of dimensionality reduction through global averaged pooling (GAP). The GAP operation calculates the average activation value within each feature map, transforming them into a one-dimensional tensor. By applying the GAP layer, the feature extraction phase of the model is completed. The purpose of this layer is to ensure that the activation values across different layers are normalized and maintain a consistent scale. This normalization helps mitigate potential instability during the adaptation process, which can arise from the propagation of excessively large values throughout the network. Following the feature extraction, a dense layer was added to make predictions for the two classes, namely crack and non-crack. The Softmax activation function was employed in this layer. To optimize the custom CNN model and its hyperparameters specifically for the task at hand, the Adam optimizer was utilized with a batch size of 32. The hyperparameter optimization process involved conducting 30 objective function evaluations guided by empirical evidence [
45,
47]. The architecture of the custom CNN model is visually depicted in
Figure 3.
The pre-trained convolutional neural network (CNN) models in our study were initialized using weights from ImageNet. These models were then adapted by removing their fully connected layers and incorporating new layers, including global average pooling (GAP), a dropout layer with a ratio of 0.5, and a dense layer activated by a Softmax function. Additionally, the convolutional layer in these models uses zero-padding and 3 × 3 kernels and is designed with 1024 feature maps.
Figure 4 in our manuscript illustrates these modified CNN architectures, highlighting their tailored configurations for our specific application.
In the selection of networks for the deep learning model, a strategic approach is employed based on several key criteria. These included the networks’ proven performance in image recognition and feature extraction, their pre-training on extensive datasets like ImageNet, adaptability to transfer learning, computational efficiency, and previous success in similar applications such as structural health monitoring. Networks like ResNet50, VGG-16, VGG-19, Inception-V3, and Xception were chosen due to their robust capabilities in accurately identifying and analyzing diverse masonry textures and conditions. The decision was further reinforced by empirical evaluations on preliminary datasets, ensuring the networks’ practical effectiveness in detecting and segmenting cracks in historical masonry structures.
Throughout the transfer learning and custom CNN training phases, a range of performance metrics, including accuracy, area under the curve (AUC), sensitivity, specificity, and the F measure, were used to evaluate the performance of the models. To expedite the computational processes, the CUDA/CUDNN libraries and the Keras API with a TensorFlow backend were employed, leveraging GPU acceleration. The training and validation of these models were carried out on a Windows 11 PC, which was equipped with 32 GB of RAM and an NVIDIA Quadro RTX 4000 GPU.
2.3. Transfer Learning
The use of pre-trained models as initial parameters for a different task is called transfer learning [
36]. This method is frequently used in some deep learning problems. With the applied transfer learning method, designers have had the opportunity to both save time and obtain high accuracy rates. It is very difficult to obtain data and design complex models for different image processing problems. With the proposed transfer learning, it is possible to achieve higher performance with fewer data numbers.
Transfer learning, on the other hand, uses pre-trained models used in the solution of different problems as a starting parameter for the solution of the desired problems and provides solutions with faster and higher performance. Solving existing problems with deep learning methods requires a lot of data. For this reason, the number of data should be large to eliminate the overfitting problem. With the transfer learning method, transfer learning is used instead of training the network with random initial values [
48,
49]. With this method, training of convolutional neural network structures with less data is provided effectively.
In this study, first, the CNN model and then the transfer learning model were used for classification on the same data set. While 80% (±2) success rate was achieved with CNN, 95% (±1) success rate was obtained with the transfer learning model. The results are important in terms of showing that the transfer learning approach is useful.
Figure 5 depicts a transfer learning architecture.
In proposed study, the weights of the pre-trained models are retained by setting the lower layers to non-trainable. This approach preserves the learned features from their original training. A global average pooling layer is integrated into the fully convolutional layer, calculating the mean of each feature map and creating a unique feature map for every class. These flattened feature maps are then processed through a series of layers—dense layers, dropout layers, and another dense layer—before entering the Softmax layer, which is crucial for multi-class classification.
The loss function used in this multi-class classification task is categorical cross-entropy, which was chosen for its effectiveness in such scenarios. The training process spans over 30 epochs with a batch size of 32, employing the Adam optimizer, which is known for its efficiency in handling large datasets and complex architectures, with a learning rate set at 10−4. This particular learning rate was chosen for its balance between speed and stability in convergence.
Through the careful evaluation of various pre-trained models, it was found that the ResNet50 model stands out in terms of performance. This model’s superior results demonstrate its robustness and suitability for the proposed specific application in identifying and classifying features within the dataset.
ResNet50
ResNet50 [
46] consists of 152 layers and has approximately 23 million parameters. ResNet50 is a widely used image recognition and object classification architecture. The ResNet50 architecture, like other deep learning network architectures, aims to improve the congestion of deep network learning. This computer architecture, which resembles a regenerative neural network, works on the principle that the input is “residual” to the output of the next two convolutional layers. There is a “residual” feature block serving the ResNet architecture and a bottleneck occurrence in the ResNet50 architecture.
Figure 6 depicts the ResNet architecture.
Equation (1) expresses the output of the residual building block.
Since there are so many layers in the network, learning H(x) will be difficult as it progresses. This is why the term “skipping link” is used. We learned F(x) as a direct x input as a final output. As a result, F(x) is referred to as “residual” in
Figure 7.
2.4. Segmentation Methods
2.4.1. CAM-K-SEG Method
Grad-CAM is a technique that combines class activation maps and saliency maps to highlight the importance of pixels with respect to a particular class label. It leverages the information present in the last convolutional layers to rank the significance of each pixel in relation to the target class. To achieve this, the gradient scores of each class are mapped onto the image features, and the associated gradient weights are computed using global average pooling (as described in Equation (2)).
In Equation (2), the weight matrix,
, with dimensions of
, is responsible for capturing the relevant feature map m for each class label c. The activation map
consists of
Z pixels, and each of these pixels is assigned a gradient score
. After its computation, the weight matrix is multiplied by the output of the last convolutional layer. This product is then activated using the Rectified Linear Unit (
ReLU) activation function [
50]. This approach, consistent with the one employed in the earlier implementation of the ResNet-50 model [
46], leverages the
ReLU function’s ability to introduce non-linearity, enhancing the model’s learning capabilities.
In Equation (3), denotes the Grad-CAM localization map tailored for class c, possessing dimensions of width u and height v. This map generates a heatmap of size , adeptly accentuating the pixels that are most relevant for the segmentation of the brick class. Integral to the Grad-CAM algorithm is the use of the ReLU function, which is pivotal in assigning positive weights to pixels identified as significant for the target class while effectively nullifying the influence of less important pixels by assigning them zero weights.
In proposed research, we developed and implemented a unique segmentation framework named CAM-K-SEG, designed to automatically generate accurately labeled masks for classified object images. This framework synergizes the
Grad-CAM visualization technique and the K-Mean clustering algorithm with pre-trained classification models, facilitating the precise prediction of surface segmentation within various object classes. As detailed in
Figure 8, the methodology for constructing the CAM-K-SEG model involved an initial application of the
Grad-CAM method for the localization of target objects. This approach leveraged the outstanding classification performance of the ResNet50 pre-trained model, utilizing its advanced capabilities to enhance the effectiveness and accuracy of our segmentation process. Subsequently, the K-Mean clustering algorithm was applied to determine the optimal K value using the elbow method and identify the number of effective colors. Thresholding was then utilized to convert the heatmap image into a binary image based on a specific pixel intensity threshold, facilitating the extraction of dominant foreground and background objects. This approach resulted in the development of an algorithm that achieved optimal performance in object segmentation. The predicted crack areas were denoted by red rectangular boxes (predicted bounding boxes), while the actual crack areas were represented by green rectangular boxes (ground truth bounding boxes), which were added to the images. By comparing the coordinates obtained from these two bounding boxes, IoU metrics were computed for each image. The framework we have proposed is composed of three primary components: image classification for segmenting crack areas, class activation mapping (CAM) visualization, and K-Mean color clustering applied to the masked crack regions.
2.4.2. U-Net Segmentation Model
U-Net, as conceptualized by Ronneberger et al. [
37], is a convolutional neural network (CNN) architecture tailored for image segmentation. Characterized by its distinctive U-shape, depicted in
Figure 9, the architecture features a symmetrical design with skip connections linking the downsampling and upsampling paths. U-Net is renowned for its various advantages, including reduced training time, simplicity, and a smaller parameter count compared to other networks, making it highly effective even with limited training data.
The downsampling path on the left side of the U-shape comprises four blocks. Each block contains two 3 × 3 convolution layers, each followed by an activation function (including batch normalization) and a 2 × 2 max pooling layer. With each pooling step, the number of feature maps doubles, allowing the network to extract and process increasingly complex features from the input image. This path is crucial for capturing contextual information necessary for accurate segmentation.
Conversely, the upsampling path also consists of four blocks. It involves deconvolution layers that merge with feature maps from the downsampling path, additional 3 × 3 convolution layers with activation functions (including batch normalization), and a final 1 × 1 convolution operation to produce the segmented image with the required number of channels.
The U-Net architecture’s strengths lie in its ability to handle limited training data, recognize and fuse multi-scale features, maintain simplicity, and deliver high-quality pixel-level segmentation results. Since its introduction, U-Net has become highly popular in medical imaging and has seen various advancements and modifications, incorporating new methodologies and integrating different imaging techniques. Its application has extended beyond medical imaging, demonstrating its versatility and adaptability to new challenges and diverse image datasets. This adaptability of the enhanced U-Net model underscores its success in adjusting to various tasks and different image datasets.
3. Results
3.1. Performance Metrics Evaluation for CNN Models
Optimal hyperparameter values selected for both custom and pre-trained CNN models are detailed in
Table 1.
Table 2 presents the performance metrics achieved by various models in identifying crack classes, using a test set comprised of images labeled as ‘crack’ and ‘no-crack’. A comparative analysis reveals that the pre-trained ResNet50 model surpasses others in key metrics such as accuracy, recall, F measure, and Matthews correlation coefficient. Except for precision, all metrics in
Table 2 indicate the superior performance of the ResNet50 model. Consequently, the Grad-CAM visualization technique, integrated with the ResNet50 model, was chosen for the development of the CAM-K-SEG model. The receiver operating characteristic (ROC) curve and precision–recall graphs, displayed in
Figure 10, further substantiate the outstanding performance of the ResNet50 model, as shown in
Table 2.
3.2. Visual Localization Evaluation
CAM-K-SEG Model
The “conv5_block3_3_conv” layer of the ResNet50 model, essential for feature extraction, has exhibited outstanding performance in distinguishing between crack and no-crack image classes. To assess the precision of Grad-CAM in pinpointing crack areas, the ResNet50 model, which demonstrated the highest efficacy in crack detection, was applied to a test set specifically curated for crack classification.
Figure 11 illustrates the localization outcomes achieved using the Grad-CAM technique with the ResNet50 model’s final convolutional feature extraction layer. This visualization method has been markedly successful in accurately identifying regions with cracks, thereby underlining its excellence in the field of crack detection.
Figure 12 methodically illustrates the sequential steps involved in the CAM-K-SEG method. The process begins with
Figure 12a the original input image, followed by
Figure 12b the creation of a heatmap image using the Grad-CAM visualization technique based on the ResNet50 model. Subsequently,
Figure 12d K-Mean clustering is applied to the
Figure 12c composite image, which is formed by overlaying the original image with the heatmap to isolate the most effective colors. This is followed by converting the image
Figure 12e to grayscale and applying filtering processes, which include the elimination of small, irrelevant objects. The image is then converted
Figure 12f into a binary format. Finally,
Figure 12g the image is segmented, and this segmented image is presented as the outcome of the CAM-K-SEG process.
3.3. Comparisons
The present study employed two distinct approaches for the detection and localization of cracks in historical masonry structures that develop over time due to external factors. The first method, known as CAM-K-SEG, employs bounding boxes to locate the crack regions in the model, allowing for a subsequent comparison with the ground truth bounding boxes. This comparison enables the computation of the Intersection-over-Union (
IoU) metric values using Equation (4). On the other hand, the U-Net segmentation model was utilized for the precise segmentation of crack areas. To facilitate this, the actual crack regions were masked, resulting in the creation of a new training dataset for the model. In both of these approaches, the
IoU metric, a commonly used evaluation metric for assessing segmentation algorithms, was employed in this study. The
IoU metric measures the similarity between two masks, namely
, and is calculated accordingly.
In
Figure 13, analyses were performed on five different cracked wall samples. On the samples, the green bounding box marks the actual crack area, while the red bounding box represents the estimated crack areas. For each sampling performed with CAM-K-SEG, it has been observed that the segmented areas and the bounding boxes surrounding them do not have the desired level of IoU scores, although they have real crack areas and are included in ground truth bounding boxes. In some samples (
Figure 13b), it is understood that not all cracks can be detected correctly due to the Grad-CAM method’s feature of estimating only the weighted regions.
The U-Net architecture was implemented in its original form to train the model using the provided dataset. The training process was conducted over 100 epochs, with a batch size of 16, where each iteration of the training data in the corresponding dataset was considered as one epoch. The loss and accuracy values achieved were calculated as 0.036 and 0.95, respectively. The evaluation of the loss and accuracy for the U-Net architecture with a resolution of 128 × 128 is visualized in
Figure 14. The training accuracy is represented by the orange curves, while the training loss rate is depicted by the blue curves.
Figure 15 shows the crack areas and IoU metric values obtained with the U-Net model. The same crack images were taken into account for comparison with the CAM-K-SEG model. It is clear that IoU metric values underperform in these samples. However, since the crack areas are very small, a small, wrong area scanning makes a big difference in the estimation of the segmented areas. This results in low IoU values. Depending on the IoU metric values, it is seen that the CAM-K-SEG model performs better than the U-Net model. However, the point to be noted here is that the comparison is made according to the IoU values determined by different methods.
3.4. Discussion
In the comparative analysis, the performance of the CAM-K-SEG model was compared with U-Net to verify the effectiveness of the proposed model. This comparison was critical to demonstrate that while the U-Net model demonstrated high performance on segmentation tasks, the CAM-K-SEG model exhibited comparable, if not superior, capabilities, particularly in the context of segmentation of crack areas within masonry structures. However, it showed superior performance in detecting crack locations.
Therefore, the CAM-K-SEG model is not only validated by this comparison but was also demonstrated to be a special solution well suited for the subtle task of crack detection in the wall. Its development was driven by the need to create a model for structural analysis that could not only segment but also effectively localize and classify crack areas.
In this study, the CAM-K-SEG model, leveraging the ResNet50 architecture, has shown a superior ability in object identification and localization of cracks, particularly when compared to traditional methods that largely depend on manual inspection and basic image processing. The model’s average IoU value of 0.70 is a testament to its effectiveness in accurately locating crack areas, a significant achievement given the complex textures and patterns often found in historical masonry structures. However, it is important to note that while the CAM-K-SEG model excels in identifying the location of cracks, it may not always precisely define the exact boundary regions of these cracks.
In contrast, the U-Net model, renowned for its image segmentation capabilities, demonstrates a high success rate in segmenting crack areas, achieving an average IoU value of 0.43. This performance is particularly notable when compared to other deep learning models used in similar applications. The U-Net model’s proficiency in segmenting small and intricate crack patterns effectively makes it an invaluable tool for detailed analysis of structural integrity.
Both the CAM-K-SEG and U-Net models stand out for their practicality, requiring less labor and proving to be cost-effective during the estimation and localization phases of crack areas. This represents a significant advancement over existing methods, which often necessitate extensive manual labor and struggle with the complex nature of historical masonry structures.
Additionally, the approach of utilizing cellphone images for crack detection introduces an element of accessibility and ease of use, which is not commonly found in many advanced methods. This adaptability makes the method more suitable for regular monitoring and assessment of historical structures, a critical aspect in their preservation and maintenance.
However, during the initial training phase of the CAM-K-SEG model, the focus was primarily on more pronounced cracks, leading to a gap in performance on subtler crack patterns. Recognizing this limitation, the importance of incorporating a dataset with finer cracks in future studies is emphasized. This will necessitate modifying the CAM-K-SEG model to enhance its sensitivity and accuracy in detecting these less obvious cracks. By expanding the scope of the dataset and refining the model accordingly, this paper aims to improve the model’s applicability in real-world scenarios, particularly in the context of historical structure preservation, where early detection of minor cracks is crucial. This planned enhancement of the CAM-K-SEG model will not only address its current limitations but also significantly contribute to the field of structural health monitoring by providing a more comprehensive and reliable tool for crack detection.
In conclusion, this comparative study with state-of-the-art methods underscores the effectiveness of the proposed models. The CAM-K-SEG model excels in identifying and localizing crack areas, while the U-Net model is highly effective in the detailed segmentation of cracks. Both models offer significant improvements over traditional methods, providing more accurate, efficient, and cost-effective solutions for crack detection in historical masonry structures.
4. Conclusions
This research presents a comprehensive analysis of crack detection in historical masonry structures using deep learning methods. This study introduces the CAM-K-SEG model, which leverages the class activation mapping technique to detect cracks induced by environmental effects with a high degree of success. The model’s strength lies in its rapid training capabilities and minimal labor requirements for operation, significantly streamlining the process of crack detection and damage mapping for engineers.
The CAM-K-SEG model, underpinned by the ResNet50 architecture, has demonstrated exemplary performance in the experiments. It not only detects but also localizes the crack areas with remarkable accuracy, making it a potent tool for structural analysis. The model’s effectiveness is rooted in its ability to facilitate rapid, accurate damage assessment, which is crucial for the maintenance and preservation of heritage structures.
Concurrently, the U-Net model was deployed to assess its segmentation prowess on the same task. While the U-Net showed remarkable success in object segmentation, indicating its suitability for high-performance segmentation jobs, the findings suggest that the CAM-K-SEG model is more adept at object identification and localization. This distinction underscores the CAM-K-SEG model’s utility in scenarios where precise crack localization is paramount.
This study’s results indicate that the CAM-K-SEG model, using ResNet50, exhibits the best performance metrics among the models tested, specifically in the segmentation of crack areas based on pixels. Meanwhile, the U-Net model’s high performance in segmentation tasks is noted, reinforcing the model’s applicability in scenarios that demand detailed segmentation over simple localization.
In synthesizing these findings, it is asserted that the CAM-K-SEG model is a significant contribution to the field of crack detection in masonry structures. It is a method that not only facilitates rapid and accurate crack area detection but also embodies the potential for future applications across a broader spectrum of structural health monitoring tasks.
This research introduces a transformative approach to detecting cracks in historical masonry structures, leveraging the CAM-K-SEG deep learning model. This method significantly surpasses traditional inspection techniques in accuracy and efficiency, offering a cost-effective, accessible solution for structural health monitoring. By utilizing cellphone images, it enables easy and widespread application, even in challenging locations. Crucially, it facilitates early detection of structural issues, essential for the preventive maintenance and preservation of heritage buildings. The quantitative data provided by the proposed model also support informed, data-driven decision-making in repair and maintenance strategies, ensuring effective resource allocation and extending the lifespan of culturally significant structures.
Future work will focus on enhancing the model’s precision in detecting finer cracks, broadening its scope to a wider range of structures, and incorporating advanced data sources like drone imagery and 3D scanning. We also aim to explore semi-supervised learning to reduce manual labeling efforts, develop real-time crack detection applications, and enhance the interpretability of the model’s decisions. These efforts are geared towards refining our approach, making it more scalable and universally applicable in the field of structural health monitoring and the preservation of global cultural heritage.