1. Introduction
Blindness that is due to diabetic retinopathy (DR) remains a leading cause of adult-onset blindness [
1]. To prevent it, and according to the American Diabetes Association and the American Ophthalmology Academy, an eye fundus examination is recommended at least once a year for diabetic patients [
2,
3]. This means that regardless of the type of diabetes mellitus (DM), all people with diabetes need to undergo regular and repeated annual retinal examinations for early detection and treatment of DR.
Data from the European Union of Medical Specialists show that there are about 40,000 ophthalmologists in Europe. Furthermore, it is estimated that in Europe, about 8.1% of the adult population has DM, and by 2030, about 9.5% of the adult population will have it [
4]. This means that currently, each ophthalmologist would have to examine 1500 diabetic patients per year and that the workload will increase in the years to come. We are therefore faced with a problem of numbers. It is estimated that only half of the diabetics in Europe are screened for DR, which may be partly related to a problem of a lack of ophthalmologists [
5].
Consequently, numerous health programs have been launched to screen as many diabetic patients as possible using different approaches. For example, the use of telemedicine or the involvement of other health professionals such as family doctors (FDs). The literature suggests that these programs are a cost-effective intervention [
6]. One such program is Retisalud, implemented in the Canary Islands, Spain. Starting in 2002 and still active today, it is one of the longest-running of its kind in Europe [
7]. Retisalud seeks to detect early signs of DR in patients with DM and without known ocular pathology. In order to do this, nurses and health technicians are trained to capture fundus images that are then analyzed by FDs, who are trained to interpret them for signs of DR. If the FD considers the retinography to be normal, the patient would repeat the screening in 2 years, or in 1 year if there is any risk factor. If the FD considers the retinography to be pathological or has any doubts, it will be sent telematically to an ophthalmologist. The ophthalmologist would then grade the image remotely and decide whether the patient should continue with the normal regular check-ups or be referred to the district ophthalmologist or to the hospital. This program proved successful, and the number of cases of severe DR or worse has decreased over the years from 14% of all cases in 2007 to 3% in 2015. In this context, a large number of fundus images classified according to the existence or not of DR have been collected in routine clinical practice.
In addition to tele-ophthalmology programs, there is widespread enthusiasm for the application of artificial intelligence (AI) for automated DR screening. A number of AI technologies exist for this purpose: for example, the IDx-DR [
8], the EyeArt AI Eye Screening System (Eyenuk) [
9], the Intelligent Retinal Imaging Systems (IRIS) [
10], or the DeepMind prototype [
11]. Broadly speaking, an AI technology for DR screening is software capable of automatically analyzing fundus images and classifying them according to the existence and degree of DR. By placing the screening process in the hands of a larger number of healthcare professionals, it is likely that the number of patients screened and, consequently, the early detection of DR will increase, holding promise for mass screening of the diabetic population. In addition, cost-effectiveness studies have been conducted to demonstrate how AI-based solutions for DR can lead to cost savings [
12]. However, there is still a long way to go to achieve the widespread implementation of AI in ophthalmology. In the specific case of DR screening, the technology is relatively new, and most of the published works are based on studies with a small number of patients [
13,
14,
15]. Therefore, despite the good results shown to date, the scientific evidence is not yet sufficient to overcome the reluctance to implement this type of technology or to analyze its possible drawbacks. Additionally, since AIs behave like a “black box”, there is a problem with the interpretability of the results. In other words, there is no justification as to why the AI arrives at a certain result. For example, the need to be able to interpret the AI classification is of particular interest in cases of mild retinopathy, as the signs of this level of severity are mostly very subtle and easy to miss, even for an expert. Given this uncertainty in interpretability, there are concerns about legal liability arising from incorrect AI analyses. In addition, there are currently relatively immature regulatory frameworks for AI systems, even though systems such as IDx-DR have already gained FDA approval [
16]. Perhaps, an initial solution could be one that shows the reason for the AI to arrive at the diagnosis, which could be used, at least initially, as a tool to assist the physician’s decision making.
In this work, we have developed a state-of-the-art AI based on the Retisalud database. The differential aspect is that this AI not only classifies an image as pathological or not but also detects and highlights those signs that allowed the DR to be identified. In addition, we retrospectively evaluated what would have happened if this AI for automatic DR detection had been available from the beginning of the project. In this way, we try to increase the evidence on the use of an AI system for the detection of DR in a large number of patients in an undisturbed clinical setting.
3. Results
Table 1 shows the results according to the segments of the test set in
Figure 2.
3.1. Segment A: Comparison with the Family Doctor
In visits that were only evaluated by FDs and with a No DR label (segment A, 149,996 cases), the AI agreed with the label in 90% of them. This 10% discrepancy merits further analysis. The 15,107 discrepant visits represent 11,824 different patients, of whom 6578 underwent at least another screening at a later date. Of these, 2798 were finally evaluated by an ophthalmologist, 1258 of them with a pathological diagnosis, which represents 7.9% of the total number of pathological patients detected by family doctors. In order to analyze whether these cases were an early detection or a matter of the specificity of the AI, a re-evaluation by another ophthalmologist of random samples from the set of 1258 patients showing discrepancy and with a definitive pathological diagnosis was performed (re-evaluation A). The specific visit to re-evaluate was the first visit in which the AI detected a sign of DR (but FDs classified it as healthy). Furthermore, a more general second re-evaluation (re-evaluation B) was performed by selecting random samples from the set of 11,824 patients with a discrepant classification between the FDs and the AI. Both re-evaluations were designed as a double-blind experiment. To guarantee a representative sample size, the number of patients to be re-evaluated was chosen following the formula:
where
is the quantile of the standard distribution for the confidence of 95%,
p is the success probability (i.e., probability of having the correct evaluation for each segment),
N is the size of each segment, and m is the margin of error. We assumed a margin of error of 5% and
p = 0.5, thus assuming the worst case.
The re-evaluation analyzed each visit and classified them into three categories: No DR, DR, or non-diabetic pathology (NDP). The re-evaluation A comprised 222 samples. A total of 25 (11%) of them were re-classified as No DR, 46 (21%) as NDP, and 151 (68%) as DR. The re-evaluation B was performed with 262 samples. A total of 80 (30%) of them were re-classified as No DR, 88 (34%) as NDP, and 94 (36%) as DR. These results suggest that the use of an AI from the beginning of the program would have allowed earlier treatment to a greater number of patients.
3.2. Segments B and C: Comparison with the Family Doctor
Of the visits that were evaluated by FDs as DR or doubtful (66,819 visits, segments B and C), 48,195 (67%) were finally evaluated as healthy by ophthalmologists. This means that the ophthalmologist received this number of extra healthy cases to grade (22% of the total patients screened by Retisalud), which goes against the objectives of the project. A more detailed understanding of the AI performance in this regard requires further analysis, as simply subtracting the data from segments B and C does not represent the actual performance. For example, a healthy patient labeled as DR together with a DR patient labeled as healthy would not alter the results of the AI when, in fact, it would imply a failure of it. Therefore, the following section analyzes the results obtained by AI with respect to the gold standard (the ophthalmologist), thus allowing reliable conclusions to be drawn.
3.3. Segments D, E, and F: Comparison with the Ophthalmologist (Gold Standard)
Following the reasoning in the previous section, to ascertain the performance of the AI,
Figure 4 shows the agreement between the ophthalmologists’ criteria (taken as ground truth) and the AI for segments D to F. Having used the AI, the number of extra visits that would have been sent to the ophthalmologist receiving a healthy diagnosis would be reduced in 35,104. However, there would be 4928 underdiagnosed patients, 443 of them with an evaluation of >MiDR. It is notable that most of the error resides in those cases with the MiDR label. Thus, the discrepancy between “No DR”/”MiDR” cases is analyzed in detail in the following paragraphs.
Segments D to F in
Table 1 can be used to roughly estimate the specificity and sensitivity of the AI. In this sense, the specificity value for segment D was 0.73. However, as only the most difficult cases or those with non-diabetic pathologies are referred to the ophthalmologist, this value represents the specificity only in this specific subset of the population and not in the complete dataset. In relation to this issue, it is important to note that, following the Retisalud screening protocol, as all cases must go through FDs first, an ophthalmologist would only label as No DR if the patient was referred by FDs as doubtful or pathological. In other words, the 48,195 visits labeled as no DR by an ophthalmologist in segment D represent the most difficult cases since they confused the FDs during the first evaluation. In addition, having specifically asked the ophthalmologists in the program about this particular issue, there could be cases in which the ophthalmologist would prefer to issue a diagnosis of MiDR only so that the patient would have another control next year instead of waiting for the next two as dictated by Retisalud.
Regarding the sensitivity, it was 0.79 for the total number of pathological visits (with any degree of DR). However, this value is completely different when subgroups are analyzed separately. Thus, analyzing only the visits with a diagnosis of MiDR, the sensitivity was 0.7, while for the cases of >MiDR, it was 0.95.
Based on the above data, the AI errors occurred mainly in the “No DR”/”MiDR” decision. In order to go deeper into this aspect and find out if this is due to an intrinsic difficulty of this type of case or to a real error in the AI, random visits from the miss-classified visits of segments D and E of
Table 1 were selected to be re-evaluated by another ophthalmologist following a double-blind process. The number of samples to be re-evaluated was chosen following the equation in
Section 3.1. From the 13,091 cases miss-classified by the AI from segment D (“DR” label by AI and “No DR” label by ophthalmologists), 266 were re-evaluated. Of these, 88 (33%) were re-evaluated as DR, 30 (11%) as NDP, and 148 (56%) as No DR. From the 4485 cases miss-classified by the AI from segment E (“No DR” label by AI and “MiDR” label by an ophthalmologist), 256 were re-evaluated. Of these, 177 (69%) were re-evaluated by a new ophthalmologist as DR, 46 (18%) as No DR, and 33 (13%) as NDP.
3.4. Results on Interpretability
The method explored in
Section 2.4 was applied to a series of images to verify its performance. Some examples are shown in
Figure 5.
This figure also shows an automatic detection of the areas of interest based on the activation maps (third column). It should be noted that, despite obtaining visually positive results, there is a major drawback when analyzing them with an automatic method that makes it very difficult to obtain an automatic detection/segmentation of all the signs of diabetic retinopathy. Specifically, a map of the “importance” measure of each pixel for classification is obtained, but this value of “importance” is not bounded to any specific range. Therefore, the decision as to what value a pixel has to have for it to be a sign of DR is given by the whole range of values obtained for each individual image. This lack of reference is the main difficulty, as it makes it necessary to use an additional algorithm to segment or cluster the resulting activation image to automatically detect areas of interest. The algorithm used in
Figure 5 is a contour detection algorithm for areas that have pixels with an “importance” value of at least 75% of the image range. It can be seen that, although it is able to detect areas of interest in all images, it fails to obtain all signs of retinopathy despite the fact that some of them are qualitatively visible in the activation image.
4. Discussion
Fundus images are particularly suitable for diagnosis using artificial intelligence because they have a certain homogeneity, a diagnosis can be made solely from them, and the signs of diabetic eye disease are structurally differentiable. This is why the development of this technology is being actively pursued to serve the growing population of diabetics.
What has struck us most about this work is that our earlier results, on a much smaller validation sub-sample of the dataset, were markedly superior [
23]. Over a sample of about 30,000 images, in that preliminary study, all cases with severity greater than mild were detected, and the sensitivity and specificity values were above 90%. This indicates that we should be cautious when evaluating the performance of an AI for DR screening, as it appears that the results do not extrapolate directly from small to large scale, even for a homogeneous population such as the one used in the present study.
With this in mind, we can say that AI rivals diagnosis by FDs on some points. For example, the AI would have been capable of detecting more than the 95% of cases worse than mild DR and having 70% fewer errors classifying the healthy cases than FDs. Furthermore, ascertained by re-evaluations of discrepant cases, the AI detected DR signs in 1258 patients before they were later detected by the FD, representing 7.9% of total pathological cases found by FDs. From the ophthalmology specialist’s point of view, the success of this mass screening technology holds the promise of the possibility of a significant decrease in the prevalence of ophthalmologically healthy patients in hospitals and early detection of those with pathological signs, which in turn should mean a reduction in waiting lists and workload.
However, the AI underdiagnosed some cases, as can be seen in segments E and F of
Table 1 (it was not possible to know the exact number of patients underdiagnosed by the FDs for comparison purposes). To put these results in context, we can compare the performance of the AI developed for this work with other existing for the same purpose. Nielsen et al. [
24] analyzed eleven different studies, reporting sensitivities and specificities of 80.28% to 100.0% and 84.0% to 99.0%, respectively, with accuracies of 78.7% to 81%. Obtaining an exact and fair measure of the real specificity and sensitivity of our study is not possible due to its retrospective nature and the Retisalud protocol itself. Nevertheless, if we consider all the visits from segment A as correctly labeled, the AI yielded a specificity of 85.7% and a sensitivity of 79.1%. In the cases of pathological visits (segments E and F), a sensitivity of 95.0% over the moderate or worse cases (segment F) and overall accuracy of 85.2%. These results are coherent with the reported in the literature. Therefore, we believe that our results may be representative of what would happen with any other AI. Therefore, we believe that this study can anticipate some of the difficulties that we will see in the coming years during the implementation of these technologies. Particularly striking is that the algorithm would not have detected 443 patients with DR with a degree of severity greater than mild. It should be noted that half a million images were analyzed in this study and all of which were viewed by a physician. It is to be expected that, perhaps in a real implementation, the actual percentage of errors will be overlooked. Therefore, control strategies need to be designed to detect this small but important number of AI failures.
It is also worthy to note that, as described in
Section 3.3, we were able to detect that some ophthalmologists changed their criteria because they felt that certain patients should be seen earlier than the protocol indicates. Naturally, this is an intrinsic part of medicine and can be beneficial to patients. However, it can be counterproductive in the context of an AI development or follow-up as they are ultimately mislabeled images. Therefore, in the context of labeling, while using an AI, tools should be provided for the correct labeling of these patients (e.g., specific labels for these cases).
Another important lesson is the discrepancy between healthy and mildly pathological patients. In view of the analyzes shown, where a blind re-evaluation by a new ophthalmologist showed discrepancies with respect to the previous ophthalmological evaluation, the search for a definitive solution is not realistic. Indeed, one of the well-known problems with the Retisalud program is that too many patients who turn out to be healthy are still telematically sent to the ophthalmologist [
7]. To better visualize the nature of these discrepancies, an example of a case of mild DR from our dataset is shown in
Figure 6.
The only sign of diabetic retinopathy that justifies the classification of this sample as “mild DR” is the small microaneurysm on the left, marked in blue. Otherwise, this retinography would be considered non-pathological. Given the subtlety of the signs in these cases, an automatic classification system that correctly classifies this case as “mild DR” but provides no other information would be insufficient. For this reason, we believe that AI that incorporates feedback about key areas is of great importance for the success of AI-based screening systems.
Based on the results of this work and all the uncertainty surrounding the use of AI for medical diagnosis today, we wonder if the current best solution might be the coexistence of evaluation by a non-specialist together with AI-based screening software. In this manner, in case the image is not directly forwarded to the ophthalmologist, the AI could be used as a “hinting” system to help non-ophthalmologists in the assessment process. In the first place, the most serious cases (>MiDR) are well identified by the AI, so they could be referred directly to the specialist without having to wait for an appointment with, for example, a family doctor. Second, having the AI criterion can give a non-ophthalmology specialist more security when classifying an image as healthy, especially if a system for signaling problematic areas is implemented.