2. Deep Learning Model Selected: Mask R-CNN
Mask Region-Convolution Neural Network (Mask R-CNN) is the DL model used in this work. This recent DL model developed by Facebook AI Research [
62] is a simple, flexible, and general framework for object detection and segmentation tasks in images. In addition, this DL model outperforms all existing DL models in instance segmentation, bounding-box object detection, and person key point detection using the Microsoft COCO dataset [
65] as input dataset. The good performance of Mask R-CNN is what led to its selection as the DL model in this work.
It should be noted that Mask R-CNN classifies and localizes each object drawing a bounding box in the image (object detection) and classifies each pixel of the image into a fixed set of categories without differentiating object instances by an image mask (semantic segmentation). In this way, on the one hand it is possible to detect the positions of both surface and subsurface defects (object detection), and on the other hand it is possible to delimit their areas (semantic segmentation) from a thermal image in an automated way. Moreover, it is also possible to differentiate the defects classified in the same category among them by combining the two previous functions of Mask R-CNN (object detection + semantic segmentation = instance segmentation).
As for the architecture, Mask R-CNN is built on top of Faster R-CNN [
66], which is another DL model used in some works described in the Introduction section. Mask R-CNN consists of two stages, the first stage being identical to the first stage of the Faster R-CNN. This first stage consists of using another DL model (known as backbone DL model), which extracts a feature map from each input image (i.e., from each element of the input dataset). Then, boxes with multiple scales and aspect ratios are applied to the feature map (denoted as ‘anchor’ boxes), which serve as references to simultaneously predict object bounding boxes and object scores. Readers should refer to [
66] for a better understanding of the term ’anchor’ box. The prediction process is performed by a Region Proposal Network (RPN). RPN is a kind of Fully Convolutional Network (FCN), which does not contain Fully Connected Layers (FCLs) in its architecture unlike the traditional DL models [
67]. To better understand and further detail the first stage of the Mask R-CNN architecture, before explaining the second stage, convolutional, activation, pooling and upsampling layers are briefly presented and explained.
A convolutional layer groups the weights between two hidden layers and the hidden layer that receives the weights. The weights of a convolutional layer are numerical values that are grouped into a tensor (convolutional kernel) of height x width x depth dimensions. Each layer of depth of a convolutional kernel determines a different importance of each part of each element of the input dataset or each part of each component of the hidden layer placed to the left of the convolutional kernel in question. In addition, a convolutional kernel has a small height and width (1 × 1 pixel, 3 × 3 pixels, 5 × 5 pixels, and so on). So, each layer of depth of a convolutional kernel slides over either the element of the input dataset or the component while performing an element-wise multiplication between its weights and the covered data of the element or component. This element-wise multiplication leads to replace the central original value of the covered data by the sum of the results of the corresponding multiplication (known as convolution). Typically, the stride of the convolutional kernel is (1,1) for the height and width movement. In short, the stride parameter dictates how big the steps are for the convolutions when sliding the convolutional kernel over either the element of the input dataset or the component [
68]. Furthermore, extra rows and columns (typically, values equal to zero) are sometimes added to the edges of the element of the input dataset or the component to slide the convolutional kernel over the most external data (known as padding) [
69]. As for the depth of the convolutional kernel, the number of versions (sets of features) of an element or component that passes to the next convolutional layer will depend on the depth value assigned to the convolutional kernel of that convolutional layer. In summary, a hidden layer has the same components as the depth value of the associated convolutional kernel. That is, each receptor component receives the convolution results according to the depth layer of the convolutional kernel related to each one (i.e., the weights of a component are the same for each connection with the components of the previous hidden layer), adding then in each one all the convolution results obtained. The idea is that the different depths of a convolutional kernel extract a different set of features from an element of the input dataset and that these sets of features help to achieve the assigned task.
Figure 3 shows an example of a convolutional layer, adapting illustrations from reference [
70].
Moreover, there is usually an activation function and a pooling or upsampling layer between two consecutive convolutional layers of a DL model. An activation function layer transforms the range of values of the set of features of a component into a range that makes a DL model work better. The most commonly used activation function layer is the Rectified Linear Unit (ReLU), which converts all negative values to 0 [
71]. The pooling layer downsamples each element of the input dataset or component [
72], while the upsampling layer is the opposite of the pooling layer [
73]. A pooling layer simplifies the information contained in the components after the application of the activation layer, summarizing the sets of features and, consequently, reducing the size of each component before moving to the next convolutional layer. Two common pooling layers are average and max pooling layer. The average and max pooling layer averages the values and selects the maximum value of the values contained in a kernel of certain dimensions (usually 2 × 2) that slides over each set of features of each component, respectively. Meanwhile, an upsampling layer works by repeating the rows and columns of the set of features of each component with some weighting (such as bilinear interpolation) before moving to the next convolutional layer. The pooling layer is required to downsample the detection of features and helps the DL model learning process to become approximately invariant to small translations of the input dataset. Meanwhile, the upsampling layer is necessary to upsample the features in order to generate an output with the same dimensions of the element of the input dataset. Then, the upsampling layer is a key layer for DL models that perform segmentation, such as Mask R-CNN.
Figure 4 shows an example of an activation function layer and a pooling layer, adapting illustrations of the references [
70,
74], and
Figure 5 shows an example of an upsampling layer, using the bilinear interpolation method to upsample, adapting illustrations from reference [
75].
Returning to FCLs, there is usually more than one FCL at the end of a DL model. In fact, FCL is actually the output layer of a DL model, where each set of features obtained after the pooling or upsampling layer application on the last convolutional + activation layer of the DL model is an input to the FCL. The reason is because these sets of features are based on the sets of features that have been extracted by the previous convolutional layers of the DL model. It should be noted that each set of features obtained from the last pooling/upsampling layer is flattened into a 1-D vector of length equal to the result of multiplying the height × width of the corresponding set of features before applying the FCL. Generally, the first FCL takes each flattened vector and then elementwise multiplies all the values of one of the flattened vectors with the values of the corresponding weights assigned to that flattened vector, and adds the result with the corresponding bias values. The process is repeated with all the flattened vectors. Then, an activation function (typically ReLU) is applied to each output of the first FCL before moving to the second FLC. In classification tasks, the last FCL produces a list of class scores. It should be noted that the number of outputs of the final FCL must be equal to the number of the different classes defined according to the classification task [
76]. Then, a softmax activation function uses the list of class scores obtained as inputs, converting them into probabilities that sum to one, where each output of the softmax activation function is interpreted as the probability of membership for a specific class. The class with the highest probability will be the class assigned to the corresponding element of the input dataset [
77].
Figure 6 shows an example of FCLs and a softmax activation function application, adapting illustrations from reference [
78].
Therefore, the difference between FCNs and FCLs lies in the way they operate. The first uses convolution followed by an activation function layer, pooling layer and/or upsampling layer, and the second uses multiplication and summation followed by an activation function to each input. So, RPN obtain lists of class scores as FCLs but by changing the depth parameter of the convolutional kernels and by applying pooling layers.
Further detailing the architecture of RPN, each ‘anchor’ box is first mapped to a lower-dimensional feature by a convolutional layer formed by a 3 × 3 × 256 convolutional kernel followed by a ReLU layer. Then, the output feature map is fed into two sibling convolutional layers: (i) One formed by a 1 × 1 × 2 convolutional kernel followed by a linear activation function, and (ii) the other formed by a 1 × 1 × 4 convolutional kernel followed by a linear activation function [
79]. The first sibling convolutional layer predicts an object score to the ‘anchor’ box, and the second predicts an object bounding box to the ‘anchor’ box. The object score is the probability of an ‘anchor’ box to represent an object or not (i.e., background) after applying the softmax activation function as the last step, and the object bounding box is a refinement of the ‘anchor’ box to better fit the object after applying a regression method as the last step. Then, instead of using the softmax activation function, a regression method is applied after the second sibling convolutional layer, specifically to the 4×1 feature map obtained from each ‘anchor box’. These four features of the feature map represent the percentage change in the position (x,y) of the centroid, in the height and in the width of the corresponding ‘anchor’ box [
66]. It should be noted that if there are several bounding boxes that overlap too much on the same object, the one with the highest object score (i.e., the bounding box with the highest probability of representing an object) is the only one that is not discarded after RPN by applying a technique known as Non-Maximum Suppression (NMS) [
80].
Focusing on the backbone DL model, a Residual Network of 100 convolutional layers and one FCL (ResNet101)-Feature Pyramid Network (FPN) is the model selected. ResNet101 [
81] is one of the most widely used DL models as feature extractor (without taking into account the FCL), of which first convolutional layers detect low-level features (e.g., edges and corners if the element of the input dataset is an image), and subsequent layers successively detect higher-level features (e.g., car, person, and sky if the element of the input dataset is an image). ResNet101 works with elements of the input dataset with 1024 × 1024 × 3 dimensions and its convolutional layers are divided into five stages: (i) First stage, a convolutional layer formed with a 7 × 7 × 64 convolutional kernel followed by a ReLU and a max pooling layer; (ii) second stage, three blocks and each having three convolutional layers formed with a 3 × 3 × 64, 3 × 3 × 64 and 3 × 3 × 256 convolutional kernel, respectively, all followed by a ReLU layer; (iii) third stage, four blocks and each having three convolutional layers formed with a 3 × 3 × 128, 3 × 3 × 128 and 3 × 3 × 512 convolutional kernel, respectively, all followed by a ReLU layer; (iv) fourth stage, 23 blocks and each having 3 convolutional layers formed with a 3 × 3 × 256, 3 × 3 × 256 and 3 × 3 × 1024 convolutional kernel, respectively, all followed by a ReLU layer; and (v) fifth stage, three blocks and each having three convolutional layers formed with a 3 × 3 × 512, 3 × 3 × 512 and 3 × 3 × 2048 convolutional kernel, respectively, all followed by a ReLU layer. It should be noted that the resulting feature map obtained after the last convolutional layer of the fifth stage has 32 × 32 × 2048 dimensions. So, the feature map size is reduced by half and the depth of the feature map is doubled at each stage of ResNet101 thanks to the max pooling layer and the strides equal to 2 in some convolutional layers [
79].
As for FPN [
82], this DL model is introduced as an extension of ResNet101 to better represent the features of the objects of an image at multiple scales. In this way, FPN improves the feature extractor of ResNet101. For that, FPN is a top-down architecture with lateral connections forming two different pyramids. The first pyramid takes the feature map obtained in each output of the stages of ResNet101 (except the output of stage 1). Then, each feature map is downsampled by a 1 × 1 × 256 kernel convolution, obtaining high-level features. Subsequently, the second pyramid takes the downsampled feature map of the output of the last stage of ResNet101 and upsamples it in order to add elementwise the downsampled feature map of the output of the previous stage. This last result is then upsampled again in order to add elementwise the downsampled feature map of the output of the third stage, and so on. All the outputs of the second pyramid are then subjected to a 3 × 3 × 256 convolutional kernel to create the feature maps (in total 4) used by the RPN, representing low-level features. It should be noted that in addition to applying ‘anchor’ boxes to these four feature maps, they are also applied to a fifth feature map obtained from a max pooling layer (reducing by half the dimensions) applied to the smallest feature map of the FPN.
The second stage of Mask R-CNN is different from the second stage of Faster R-CNN. In addition to applying the bounding box recognition branch to the outputs of the RPN predicted with a positive object score (RPN outputs known as candidate object bounding boxes), Mask R-CNN adds a parallel mask prediction branch. The bounding box recognition branch predicts the final object class (here the type of object is predicted as opposed to the RPN, which only differentiated between object and background), and the final object bounding box (with a better fit compared with the RPN outputs), to each candidate object bounding box. Meanwhile, the mask prediction branch outputs a binary mask for each candidate object bounding box.
It should be noted that a mapping operation is applied to the candidate object bounding boxes before the bounding box recognition and mask prediction branches application. A mapping operation is required in order to map the positive object scores and the object bounding boxes obtained with RPN onto the corresponding feature maps used as input to RPN. Otherwise, it would not be possible to apply the bounding box recognition and mask prediction branches with only the object scores and the values of the centroid position, width, and height of the object bounding boxes. The standard mapping operation is the Region of Interest Pool (RoIPool), which rounds down the coordinates of the four corners of each object bounding box. The boundaries of the ‘anchor’ boxes fit well into the features maps of input to the RPN. However, it is possible that the boundary of an object bounding box is dividing the values of the corresponding feature map when mapping, instead of being bounding as the ‘anchor’ boxes, due to the process of refinement. With the rounding-down process (known as quantization), the previous problem is solved, and it works well in
object detection tasks (e.g., it is used in Faster R-CNN) [
66]. However, it is not the ideal solution for
semantic segmentation. This is because the per-value spatial correspondence of the object bounding box obtained after the RPN application is not faithfully preserved due to the rounding down process. Therefore, RoIPool is replaced by Region of Interest Align (RoIAlign) in Mask R-CNN. The main difference between RoIPool and RoIAlign is that RoIAlign does not apply the rounding down process. Instead, RoIAlign uses bilinear interpolation to compute the exact values of the feature maps in the corresponding object bounding boxes. Specifically, bilinear interpolation is applied at four regularly sampled locations places within each of the 49 equally divided parts within each object bounding box (each object bounding box is divided in 7 × 7 parts) and aggregating the different results using a max or average pooling layer. In this work, bilinear interpolation is only computed at a single point located in the center of each divided part of the object bounding box, which is nearly as effective as using four regular sample points.
Figure 7 shows a RoIAlign example from [
83].
Focusing on the architecture of the bounding box recognition and mask prediction branches, the first branch consists of 2 FCLs with 1024 outputs each. Since the depth of all the outputs after the FPN application is equal to 256, and each object bounding box is divided into 7 × 7 parts during the RoIAlign process, all the candidate object bounding boxes have the same dimensions (7 × 7 × 256). It should be noted that a ReLU layer is applied after each FCL, and an additional third FCL is applied, which is actually two FCLs in parallel to predict the object classes and the final object bounding boxes, respectively. In this way, the number of outputs of the first parallel FCL (FCL
31) is equal to the number of object classes according to the assigned task, and the other FCL (FCL
32) is equal to the multiplication of 4 (position of the centroid (x,y), weight and height of an object bounding box) by the number of object classes. Then, a softmax activation function is applied to the FCL
31, and a regression method is applied to the FCL
32, to obtain the final object class and the final object bounding box of candidate object bounding box, respectively [
79].
As for the mask prediction branch, an FCN is applied to each final object bounding box of output of the bounding box recognition branch. The RoIAlign process is again applied so that the final object bounding boxes contain the corresponding feature maps used as input to RPN. In this case, each final object bounding box is divided by 14 × 14 parts during the RoIAlign process. Then, the first convolutional layer of the FCN takes as input a 14 × 14 × 256 final object bounding box, using a 3 × 3 × 256 convolutional kernel and followed by a ReLU layer. The same convolutional kernel and activation function layer is used from the second to the fourth convolutional layer. Subsequently, a transpose convolutional layer is applied, which performs an inverse convolution operation [
62]. In this last layer, a 2 × 2 × 256 convolutional kernel with a stride equal to 2 and a ReLU layer is used. Thus, with that stride, the transpose convolutional layer allows doubling the dimensions of a final object bounding box (it is as a type of upsampling layer), instead of being halved with a convolutional layer. Finally, a convolutional layer with a 1 × 1 × number of object classes (according to the assigned task) convolutional kernel is applied followed by a sigmoid layer that is another type of activation function [
79], obtaining a binary mask for each class and selecting as the final binary mask the binary mask containing the final object class associated with the corresponding final object bounding box coming from the bounding box recognition branch [
62].
Figure 8 represents the architecture of Mask R-CNN in its simplified version.
Moreover, the loss function of Mask-RCNN consists of five different terms:
RPN_class_loss: The performance of objects can be separated from background via RPN.
RPN_bounding_box_loss: The performance of RPN to specify the objects.
MRCNN_class_loss: The performance of classifying each class of object via Mask R-CNN.
MRCNN_bounding_box_loss: The performance of Mask R-CNN for specifying objects.
MRCNN_mask_loss: The performance of the object segmentation via Mask R-CNN.
When the loss values of these five terms are smaller, the performance of Mask R-CNN improves, as indicated in the equation below.
Readers should notice that Mask R-CNN follows supervised learning as strategy. Then, the labelling of each defect area of the marqueteries in the thermal images is performed by the VGG Image Annotator (VIA) software [
84] to get the ground truth of the object classes, object bounding boxes, and object masks, and thus to compute the total loss value at each epoch of the Mask R-CNN learning process.
5. Conclusions
This work introduces DL in the thermographic monitoring of cultural heritage for the automatic detection of the defect positions and the automatic segmentation of the defect areas, regardless of the defect type and defect depth. For that, two different types of marquetry have been used as heritage elements, the first with one surface defect (simulating the resin pocket effect) and three subsurface defects (simulating the honeycombing effect), and the second with five surface defects (representing different missing tesserae) and two subsurface defects (simulating detachments). As for the monitoring, two different experiments have been applied to each marquetry, one heating the marqueteries with pulsed heat (pulsed thermography) and the other heating the marqueteries with waved heat (step-heating thermography). The thermal images belonging to the transient cooling period have been selected as the thermal images of interest in each experiment due to the higher presence of the thermal footprints of the defects than in the thermal images belonging to the heating period and the stationary cooling period.
As an added value, the latest state-of-the-art DL model for object detection and segmentation tasks in images has been selected in this work: Mask R-CNN. This DL model makes it possible to detect the position of the different defects by bounding boxes (object detection) and to segment the areas of the defects by binary masks (semantic segmentation), with a certain probability that they are really defects and not background. By combining the bounding boxes and the binary masks obtained, the differentiation between the different defects as different instances is also achieved (instance segmentation). Moreover, in addition to the typical optimization methods used to improve the performance of a DL model during its learning process (appropriate values of the hyper-parameters, data augmentation, and transfer learning), two automatic thermal image pre-processing algorithms based on thermal fundamentals have also been applied to the thermal image sequences used for the learning process (input dataset) of the Mask R-CNN. Both thermal image pre-processing algorithms improve the contrast between defective and sound areas, the first one by compensating the non-uniform background heating and cooling, and the second one by highlighting the segmented areas obtained in the outputs of the previous pre-processing algorithm that represent the total or partial area of the defects with the most outstanding thermal footprints. The purpose is to demonstrate how it is possible to improve the performance of DL models applied to thermographic data by combining them with thermal fundamentals.
The results obtained from the learning process of Mask R-CNN were promising in two aspects:
In being able to automate the interpretation of the acquired thermal images with a high percentage of success in the detection and segmentation of defects. With the state-of-the-art IRT data processing algorithms, the identification of the deepest defects of the marqueteries is not possible, neither with non-automatic algorithms (such as PCT, SPCT, and TSR) nor with self-developed automatic algorithms (without using DL).
In the reduction: (i) In time (epochs), (ii) in object detection, semantic segmentation, and instance segmentation errors (loss), (iii) and in learning instability (learning curve); using the resulting thermal images after the application of the proposed pre-processing thermal image algorithms instead of using the corresponding raw thermal images.
In summary, this work takes the first step in the use of DL models for the inspection of cultural heritage with thermographic data, using one of the best DL models currently available and even improving its performance by using algorithms that exploit the thermal information contained in the thermal images. The robustness of the DL model trained in this paper will probably be acceptable when applied to other types of marqueteries and other heritage objects with the same and/or different defects. The reason for this is that: (i) Two different experiments have been performed on each marquetry (pulsed thermography and step-heating thermography), (ii) both marqueteries have materials commonly used on decorative surfaces of heritage objects, and (iii) several types of defects with different sizes are located at different positions and depths. In any case, future research will continue with the joint application of DL and IRT data pre-processing algorithms in thermographic monitoring of cultural heritage. Especially, future research will be based on the analysis of the same and/or other defects in more types of marqueteries and other artistic objects in order to: (i) Classify between different types of defects (and not only classify between defect and background), and (ii) train a DL model with a higher variety of defects and objects towards a more robust learning. Finally, the automatic estimation of the defect depth is another point to be considered.