3.1. Datasets
The dataset employed for training and evaluating various CNNs in this study comprises two components. Firstly, the publicly available image set RIM-ONE DL [
17] is utilized, encompassing 172 images of glaucomatous eyes and 313 images of healthy eyes. Additionally, eye fundus images sourced from medical specialists within our research team, affiliated with the Hospital Universitario de Canarias, are included. This supplementary set consists of 191 glaucoma images and 63 images of healthy eyes. These images, which are not publicly available, were acquired with a Topcon TRC-NW8 multifunctional non-mydriatic retinograph. Patients with a diagnosis of primary or secondary open-angle glaucoma with untreated intraocular pressure greater than 21 mmHg were included in the study. The diagnosis of glaucoma was not only based on the observation of the eye fundus images but on the presence of reproducible defects in the white–white perimetry and/or morphological criteria based on a spectral domain optical coherence tomograph SD-OCT Spectralis with the glaucoma Premium and Posterior Pole module. Both eyes were included if they met the inclusion criteria. Subjects with concomitant ocular pathology other than glaucoma, lower visual acuity of 20/40, refractive error greater than five diopters of spherical equivalent or three diopters of astigmatism, level of false positives, negatives, and fixation errors equal to or greater than 25% in the visual field, were excluded from the study. Patients with hypoplastic or oblique optic nerves were also excluded. Considering the two sets, a total of 739 images have been used: 363 of glaucomatous eyes and 376 of healthy eyes. All these images have been annotated by two experts and include manual disc and cup segmentation.
For a more thorough validation of the CNNs, we have used other publicly available databases, namely REFUGE, DRISHTI-GS1, and G1020. The REFUGE challenge database [
18] is composed of 1200 retinal images, of which 10% (120 samples) correspond to glaucomatous subjects, including primary open-angle glaucoma and normal tension glaucoma. The dataset also contains the ground-truth segmentation of the disc and cup. The DRISHTI-GS1 dataset [
19] contains 101 fundus images with different resolutions and ground truth labels for the optic disc and cup. The G1020 dataset [
20] is a large dataset of retinal fundus images for glaucoma classification. This dataset consists of 1020 high-resolution color fundus images and provides annotations of the ground truth for glaucoma diagnosis; optic disc and optic cup segmentation; vertical CDR; neuroretinal rim size in the inferior, superior, nasal, and temporal quadrants; and location of the optic disc bounding box. Importantly, the authors of the G1020 dataset acknowledge that this is a very difficult set to classify automatically because it represents fundus imaging in routine clinical practice and does not impose strict inclusion criteria on the images captured.
3.2. Deep Learning Models
In this section, we explain how the CNN models used in this work were trained. To do so, we describe how the training and test sets were obtained from the available data, the training strategy, and the parameters selected for each CNN. Finally, we show the performance attained.
In training the CNNs for this study, the initial set of 739 images underwent a random division into training and test sets, maintaining an 80/20 proportion, respectively. The training set comprised 290 retinographies of glaucoma and 301 of healthy eyes, while the test set included 73 retinographies of glaucoma and 75 of healthy eyes. Additionally, the training set was further partitioned into five distinct training and validation subsets, employing a 5-fold approach.
We selected four prominent convolutional neural network architectures, namely VGG19 [
21], ResNet50 [
22], InceptionV3 [
23], and Xception [
3], based on their widespread adoption and extensive utilization in the domain of glaucoma diagnosis and other medical fields. These architectures have also been evaluated in similar studies, such as the works analyzed in
Section 2.
All specified neural network architectures are accessible in the Keras module of the Tensorflow v2 package [
24]. To tailor these models to our specific problem, we modified the top layer of each network. The adaptation involved introducing a
DropOut layer, succeeded by a
Flatten layer, and then a
Dense layer featuring 128 neurons with
ReLU activation. Subsequently, another
DropOut layer was added, followed by a final
Dense layer with two outputs utilizing the
SoftMax activation function.
For VGG19, the DropOut rate was set to 0.5, while for InceptionV3, ResNet50, and Xception, it was set to 0.2. Additionally, in these three networks, the BatchNormalization layers were maintained in inference mode to prevent the non-trainable weights from being updated during the training phase.
Concerning the size of the Input layer, it was configured as 224 × 224 × 3 for ResNet50 and VGG19, and 299 × 299 × 3 for InceptionV3 and Xception.
Our training strategy is the same for all the models. First, starting with the pre-trained weights from ImageNet, the base model was frozen and we trained the new top layer for 200 epochs using an Adam optimizer, with a learning rate of 1 × 10−6, and categorical cross-entropy as the loss function. Second, we unfroze the base model, except for the BatchNormalization layers, and trained the entire model end-to-end for 250 epochs, using the same optimizer, with a learning rate of 1 × 10−5, and the same loss function as before. In all cases, a batch size of 8 was used.
Regarding the pre-processing step, we implemented the pre-processing function included in Keras for each network. To mitigate over-fitting, we employed data augmentation on the input samples. This involved the introduction of random contrast adjustments (), random brightness variations (), horizontal random flipping, random rotation () with nearest fill mode, random translation () both horizontally and vertically with nearest fill mode, and random zooming () while preserving the aspect ratio, using nearest fill mode.
The final weights for each model were chosen from the epoch that maximized the validation accuracy on average among the five folds. This resulted in five different models per network architecture, totaling 20 models.
The trained CNN models were tested with the independent set we mentioned previously, which consists of 75 samples from healthy subjects and 73 from glaucoma subjects. The results achieved per network architecture and fold have been included in
Table 1, highlighting those corresponding to the best performance per architecture in terms of balanced accuracy, which is the arithmetic mean of sensitivity and specificity [
25]. Analogously,
Table 2,
Table 3 and
Table 4 show the performance obtained by evaluating the trained models on the REFUGE, DRISHTI-GS1, and G1020 datasets. It is important to remark that these additional image sets have only been used for testing.
Although an exhaustive analysis of the performance of the trained CNN models is not the subject of this article, it can be seen that the models classify reasonably well the images of all the sets considered, except those of the G1020 dataset. The loss of performance achieved with this set is striking, confirming, therefore, what the authors of this dataset indicated in their publication, as commented in
Section 2.
3.3. Saliency Methods
This paper evaluates a set of attribution/saliency methods for the problem of glaucoma diagnosis with CNNs trained with eye fundus images. These techniques aim to identify which features of the input image are most influential in the final prediction of the model. To do so, they generate attribution/saliency maps, which are maps in which the pixels of the image that seem most relevant to the decision are highlighted. In this study, we obtain the attribution from the model’s inferred class, whether it is correct or not. Therefore, in our interpretation of the attribution map, the most important thing is to find the features that are relevant to each particular model in the construction of its decision.
In [
26] a very complete review of the currently existing techniques for interpreting the internal behavior of machine learning systems, including CNNs, is presented. In this work, two types of interpretability are differentiated, global interpretability and local interpretability. Global interpretability is attained when the user is able to understand how the model works at a global level by inspecting the internal structures and parameters of the model. On the other hand, local interpretability analyzes an individual prediction of the model and attempts to explain what input feature led to the particular decision corresponding to that prediction.
The attribution methods analyzed in this paper fall within the second category, as their objective is to provide an interpretation of the model’s decision for a particular sample within the domain, in our case a specific fundus image. However, the research strategy carried out also aims to achieve a global interpretation by studying the attribution results across a set of samples. This approach is intended to unveil insights into general aspects of the performance exhibited by the trained models. For this purpose, nine different attribution techniques, as outlined in
Table 5, were examined. The selection of these saliency methods was driven by their widespread usage in analogous studies, as detailed in
Section 2, where their effectiveness in elucidating model decisions and their contribution to the interpretability of deep learning models in medical image analysis was extensively evaluated.
The Grad, GBack, SGrad, SGrad2, and VGrad methods rely on the gradient calculation to obtain a sensitivity measure, i.e., how much a variation in the input contributes to a variation in the output. On the other hand, the IGrad, Occl, GCam, and SCam methods analyze how much of the CNN output can be attributed to the contribution of a feature. Therefore, the correct interpretation of the attribution maps must take this fact into account. In what follows, we briefly describe the basics of each of these methods.
The Gradient (Grad) method [
7,
27] is based on the calculation of the gradient, using the backpropagation algorithm together with automatic differencing to estimate the sensitivity of the model to each input. It is a fast method with a simple interpretation but has some important drawbacks. There is a strong dependence between the sensitivity recorded for one feature and the value of the other feature due to the nonlinearity of the models, which makes it a very noisy method. Moreover, when the model is saturated for a subset of its features, the gradient remains zero even for large variations of the inputs [
28].
The Guided Backpropagation (GBack) method [
29] is another gradient-based visualization technique that allows only the propagation of positive gradients through the
ReLU activation function, changing the negative gradient values to zero. The elimination of these negative values reduces the noisy appearance of the attribution map.
The SmoothGrad (SGrad) method [
30] attempts to reduce the noise of the saliency maps through the perturbation of the original image by adding noise to this image before calculating the gradient. This is repeated N times and the resulting gradients are averaged, leading to a reduction of the apparent noise in the attribution map while still emphasizing important regions. Variants of SmoothGrad are SmoothGrad Squared (SGrad2), in which the gradient is squared before averaging, and VarGrad (VGrad) [
31], where the attribution value is obtained from the variance of the gradients.
The Integrated Gradient method (IGrad) [
32] integrates the gradient attribution map between a baseline image and the input image, and the result is multiplied by the difference between the input image and the baseline. The baseline image can be a black image, an all-white image, or a random image. It is similar to the SmoothGrad method because it works from a set of perturbed images. In the SmoothGrad method, they are perturbed by adding noise, while in the IGrad method, this perturbation is a linear interpolation between the baseline and the original image.
The Occlusion method (Occl) [
33] attempts to discover which features of an input image are the most influential in the network decision. They start from the assumption that the contribution of a feature can be determined by measuring how the prediction changes when that feature is occluded. The key question is as follows: Which parts of the input image, if the model cannot see them, would change the final prediction the most? Its simplest implementation is to replace a region of the input image with a uniform square of a given color, but noise or specific textures can also be used as replacements. The region is slid over the entire image, and the differences in prediction values for the reference class are given as the estimated saliencies.
Both occlusion-based and gradient backpropagation-based interpretation methods do not consider explicitly the intermediate layers of the network, which may contain important information about its interpretability. There is a third type of method that investigates what happens in the hidden layers of the network, to try to determine which features of the input are the most relevant. This is the case of GradCAM (GCam) [
34] and ScoreCAM (SCam) [
35], which generate saliency maps by combining the feature maps of intermediate layers. GCam uses the gradient information flowing into the last convolutional layer to assign importance values to each neuron, for a given output decision. These importance values or weights are applied to compute the weighted sum of the feature maps generated by each neuron of this last convolutional layer, thus obtaining the saliency map. SCam applies an occlusion technique on the input image using the feature activation maps of the last convolutional layer of the network as masks (previously adjusting the size of these feature maps to the size of the original image). In this way, a score is obtained for each feature map, which indicates the importance of that feature map in the final predicted class. As in GCam, these scores are used as weights to calculate the weighted sum of the feature maps and thus obtain the final saliency map. The authors of SCam claim that this method is better than GCam because, by not relying on gradients, it avoids irrelevant noise and generates cleaner and more meaningful explanations.
Figure 2 and
Figure 3 show examples of the saliency maps obtained with the different attribution methods considered.
Figure 2 groups the methods based on gradient backpropagation, while
Figure 3 shows the saliency maps of the remaining methods. The results of the CAM methods are shown at the resolution at which they were actually calculated instead of interpolating them, as is usually carried out, to avoid introducing distortions in the results.
3.4. Evaluation Methodology
In order to discover the most relevant information contained in the saliency maps, the evaluation methodology that we have followed in this work is partially inspired by the one used in [
12]. In that study, the authors calculate for each saliency map what they call the “localization utility”, which is a measure of the coincidence between the areas of the original image marked as relevant in the saliency maps and those studied by the medical specialists in their diagnosis. Therefore, for a saliency method to be considered useful, the maximum values of the saliency map must be located in the image area indicated by the experts in the ground truth.
The study performed in [
12] was carried out with radiology datasets, in which the image areas of diagnostic interest are well determined and localized. However, for specialists diagnosing glaucoma, it is not so easy to precisely specify the most relevant regions in retinal images due to the lack of a reliable diagnostic biomarker. The diagnosis of glaucoma is usually based on the joint analysis of the patient’s clinical history and the results of various structural and functional tests. This is the main reason why, in this work, we have followed a different approach for the evaluation of the localization utility of the CNNs used.
As mentioned in the introduction, glaucoma is characterized by a progressive enlargement of the optic cup area, leading to a narrowing of the neuroretinal rim [
2]. In glaucoma, the enlargement of the optic cup occurs in all directions, but generally, some areas are affected earlier than others. Thus, in eyes with modest glaucomatous damage, rim loss is found mostly at the temporal inferior and temporal superior regions. In eyes with moderately advanced glaucomatous atrophy, the temporal region is the location with the most relative marked rim loss. In cases of highly advanced glaucoma, the least pronounced rim loss is typically observed in the nasal disc area, while the nasal inferior region tends to exhibit greater impairment than the nasal superior region. Examples of a healthy case and an advanced glaucomatous case can be seen in
Figure 2a,m, respectively. In order to identify these zones of the image in a standard way, some sectors are defined in the optic disc [
36], as shown in
Figure 4. This sectoral identification allows us to relate the most relevant zones of the saliency maps with the areas of interest for medical specialists, which facilitates the calculation of the localization utility of the analyzed method. In addition, the sector analysis allows us to dispense with the pixel level and perform a higher-level study that is less noisy and of greater meaning for the specialist. In [
15], a discretization of the problem is also proposed but through an arbitrary regular grid.
Formally, given an input image
I and a model
f, a saliency or attribution method, denoted as
, can be defined as a function that assigns a relevance score to each pixel
p in the image. Consequently, we transition from this conventional saliency map to a discretized counterpart, denoted as
, which allocates relevance scores to sectors associated with the image
I. This is computed as follows:
Here,
represents the binary mask for pixels encompassed within the sector being examined, and
is the binary mask derived by applying a threshold to the saliency map. This threshold retains pixels with values surpassing 75% of the maximum value, specifically:
Therefore, the saliency score of each sector is the fraction of mask pixels that fall within that sector. The choice of the threshold (0.75 of the maximum value of the saliency map) was made empirically. Neither the maximum value of the saliency map nor the average value per sector has been used as a threshold because we have found that these choices lead to noisy and unreliable results. In
Figure 5, we exemplify the transformation of a saliency map into a discretized saliency map for the respective sectors. This conversion process is performed following Equations (
1) and (
2).
In addition to discovering the most relevant sectors from the information contained in the saliency maps, we have carried out a complete comparative study between saliency maps to measure the degree of agreement between saliency methods for a specific model, the degree of agreement between saliency methods for models with the same architecture, and the degree of agreement between saliency methods for models with different architectures.
Spearman’s rank correlation [
37] served as the method for quantifying the level of agreement among the saliency maps derived from all the experiments conducted within this study. Furthermore, to have a baseline of what can be considered a reasonable correlation according to [
38], we performed two additional tests for each of the 20 trained models: a model parameter randomization test and a label randomization test.
The model parameter randomization test compares the output of a saliency method on a trained model with the output of the same saliency method on an untrained network of the same architecture, with its parameters randomly initialized. In [
38], it is pointed out that if the saliency method is truly useful and representative, it must depend on the learned parameters of the model, and therefore, it is expected that its output differs substantially between the two cases. A high correlation between the outputs may indicate that the saliency map remains unresponsive to the characteristics of the model, which is an undesirable outcome. This basic requirement can be used to rule out saliency methods.
In the label randomization test, the class labels are randomized. For this purpose, the models are retrained by changing the labels of 50% of the samples with which they are trained. According to [
38], a high correlation between the saliency maps of well- and poorly trained models is also undesirable, and if it occurs, it may be an indication to discard that saliency method.