Comparison of the Performance of Convolutional Neural Networks and Vision Transformer-Based Systems for Automated Glaucoma Detection with Eye Fundus Images
Abstract
:1. Introduction
2. Materials and Methods
2.1. Description of the Datasets
- The public database Rim-ONE DL [43] is composed of 172 images of glaucoma and 313 of normal eyes;
- Fundus images were collected by the medical specialist of our research team, belonging to the Canary Islands University Hospital (Spain). This set is composed of 191 images of glaucoma and 63 of normal eyes. These images, which are not public, were acquired with the Topcon TRC-NW8 multifunctional non-mydriatic retinograph. This study was conducted in accordance with the Declaration of Helsinki and approved by the Research Ethics Committee of the Canary Islands University Hospital (CHUC_2023_41, 27 April 2023). Confidentiality of personal data was guaranteed.
- The Drishti-GS1 database [45] consists of 101 images, of which 70 are classified as having glaucoma and 31 as having healthy eyes;
- The Papila dataset [46] consists of 421 images, of which 87 are classified as having glaucoma and 334 as having healthy eyes;
- The REFUGE challenge database [44] consists of 1200 retinal images. Of the total dataset, 10% (120 samples) correspond to glaucomatous subjects.
2.2. Description of the Selected DL Architectures
- CNNs are designed to exploit local spatial correlations through the use of convolutional layers, which makes them effective at capturing local patterns and features in an image. In contrast, ViT-based systems employ an attention mechanism that allows them to capture global relationships between image patches, which makes these systems more suitable for handling long-range dependencies and capturing the global context of an image.
- CNNs require a larger number of parameters, which makes them computationally expensive. ViTs take advantage of attention mechanisms to capture the global context more efficiently with fewer parameters. This can be advantageous when computing resources or memory are limited.
- CNNs are intrinsically translation-invariant, thanks to the use of shared weights in the convolutional layers. This property is very useful for image classification. In contrast, ViTs do not have this inherent translation invariance due to their self-attenuation mechanism, although it can be incorporated by positional embedding.
- Both CNNs and ViT-based systems have been shown to have a high generalisability capacity when trained on large datasets. In this sense, ViT-based systems are more sensitive to the size of the training set, as they tend to require larger sets than CNNs.
- CNNs lack interpretability; it is difficult to understand how they make their decisions, and they are often used as “black boxes”. In that sense, the attention mechanisms of ViT-based systems are more useful, as they allow for analysing which parts of an image are more relevant when making predictions.
2.2.1. Convolutional Neural Networks
- VGG19 contains 19 layers: 16 convolutional layers grouped into 5 blocks and 3 full connection layers [18]. After each convolutional block, there is a pooling layer that decreases the size of the obtained image and increases the number of convolutional filters applied (Figure 2). The dimensions of the last three full connection layers are 4096, 4096, and 1000, respectively, because VGG19 was designed to classify Imagenet images into 1000 categories;
- ResNet50 is a network that allows hops in layer connections to facilitate training and improve its performance. It consists of 49 convolutional layers, two pooling layers, and a full connection layer (Figure 3). The blocks that make up the network follow a bottleneck design that reduces the number of parameters and matrix multiplications [40];
- InceptionV3 consists of 48 depth layers combining convolutional, pooling, and fully connected layers with concatenation filters (Figure 4). The network is distributed in “spatial factorisation” modules, which apply different convolutional layers of different sizes to the input image to obtain general and local features. The concatenation filter combines the results provided by the spatial factorisation module into a single output, which will be the input of the next module [41];
- Xception is a variant of the Inception architecture that focuses on the use of separable convolutions instead of standard convolutions. Separable convolutions split the convolution operation into two stages: a first stage that performs convolutions on each input channel individually, followed by a linear combination stage that fuses the learned features [42]. The architecture of this network, which is 71 layers deep, is shown in Figure 5.
2.2.2. Vision Transformers
2.2.3. Hybrid Systems
- DeiT is a ViT trained with a transfer learning technique called “knowledge distillation” [27]. A larger convolutional model is used as a “teacher” to guide the training of the smaller model, which is the ViT part. For this purpose, the “distillation token” is introduced, and the goal of the ViT part is to reproduce this label instead of the class label (Figure 10);
- CaiT [26] is based on DeiT and also uses distillation training. In CaiT, a LayerScale normalisation is added to the output of each residual block, and new layers called “attention class” layers are incorporated. These layers allow the separate computation of inter-patch self-attention and classification attention to be finally processed by a linear classifier (Figure 11). In our work, we have tested two versions of this model: the “CaiT_XXS36” version with 17.3 million parameters and the “CaiT_S24” version with 46.9 million parameters;
- CeiT [28] incorporates several modifications over the original ViT. Firstly, it uses a low-level feature extraction module (image-to-token) that is applied to the input image. In the encoder blocks, the Feed-Forward Network (FFN) is replaced with a Locally Enhanced Feed-Forward Network (LeFF), which promotes correlation between tokens with convolution operations (Figure 12). Finally, a new type of block called “Layer-wise Class Token Attention” (LCA) is added, which contains a Multi Self-Attention MSA layer and an FFN. Its mission is to compute attention only on the class token, thus reducing the computational cost;
2.2.4. Other Architectures Inspired by Vision Transformers
2.3. Design of the Experiments
- Training set: 290 retinographies of glaucoma and 301 of healthy eyes;
- Test set: 73 glaucoma retinographies and 75 healthy eyes.
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Correction Statement
References
- Tham, Y.-C.; Li, X.; Wong, T.Y.; Quigley, H.A.; Aung, T.; Cheng, C.-Y. Global Prevalence of Glaucoma and Projections of Glaucoma Burden through 2040: A Systematic Review and Meta-Analysis. Ophthalmology 2014, 121, 2081–2090. [Google Scholar] [CrossRef]
- Weinreb, R.N.; Aung, T.; Medeiros, F.A. The Pathophysiology and Treatment of Glaucoma: A Review. JAMA 2014, 311, 1901–1911. [Google Scholar] [CrossRef] [PubMed]
- Bernardes, R.; Serranho, P.; Lobo, C. Digital Ocular Fundus Imaging: A Review. Ophthalmologica 2011, 226, 161–181. [Google Scholar] [CrossRef] [PubMed]
- Yamashita, R.; Nishio, M.; Do, R.K.G.; Togashi, K. Convolutional Neural Networks: An Overview and Application in Radiology. Insights Into Imaging 2018, 9, 611–629. [Google Scholar] [CrossRef] [PubMed]
- Torres, J. First Contact with Deep Learning: Practical Introduction with Keras; Independently Published: Barcelona, Spain, 2018; Available online: https://torres.ai/first-contact-deep-learning-practical-introduction-keras/ (accessed on 4 October 2023).
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
- Anthimopoulos, M.; Christodoulidis, S.; Ebner, L.; Christe, A.; Mougiakakou, S. Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neural Network. IEEE Trans. Med. Imaging 2016, 35, 1207–1216. [Google Scholar] [CrossRef] [PubMed]
- Acharya, U.R.; Oh, S.L.; Hagiwara, Y.; Tan, J.H.; Adam, M.; Gertych, A.; Tan, R.S. A Deep Convolutional Neural Network Model to Classify Heartbeats. Comput. Biol. Med. 2017, 89, 389–396. [Google Scholar] [CrossRef] [PubMed]
- Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. Breast Cancer Histopathological Image Classification Using Convolutional Neural Networks. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24 July 2016; pp. 2560–2567. [Google Scholar]
- Taormina, V.; Raso, G.; Gentile, V.; Abbene, L.; Buttacavoli, A.; Bonsignore, G.; Valenti, C.; Messina, P.; Scardina, G.A.; Cascio, D. Automated Stabilization, Enhancement and Capillaries Segmentation in Videocapillaroscopy. Sensors 2023, 23, 7674. [Google Scholar] [CrossRef]
- Wan, Z.; Wan, J.; Cheng, W.; Yu, J.; Yan, Y.; Tan, H.; Wu, J. A Wireless Sensor System for Diabetic Retinopathy Grading Using MobileViT-Plus and ResNet-Based Hybrid Deep Learning Framework. Appl. Sci. 2023, 13, 6569. [Google Scholar] [CrossRef]
- Gour, N.; Khanna, P. Multi-Class Multi-Label Ophthalmological Disease Detection Using Transfer Learning Based Convolutional Neural Network. Biomed. Signal Process. Control 2021, 66, 102329. [Google Scholar] [CrossRef]
- Simanjuntak, R.; Fu’adah, Y.; Magdalena, R.; Saidah, S.; Wiratama, A.; Ubaidah, I. Cataract Classification Based on Fundus Images Using Convolutional Neural Network. Int. J. Inform. Vis. 2022, 6, 33. [Google Scholar] [CrossRef]
- Velpula, V.K.; Sharma, L.D. Automatic Glaucoma Detection from Fundus Images Using Deep Convolutional Neural Networks and Exploring Networks Behaviour Using Visualization Techniques. SN Comput. Sci. 2023, 4, 487. [Google Scholar] [CrossRef]
- Joshi, S.; Partibane, B.; Hatamleh, W.A.; Tarazi, H.; Yadav, C.S.; Krah, D. Glaucoma Detection Using Image Processing and Supervised Learning for Classification. J. Healthc. Eng. 2022, 2022, 2988262. [Google Scholar] [CrossRef] [PubMed]
- Gómez-Valverde, J.J.; Antón, A.; Fatti, G.; Liefers, B.; Herranz, A.; Santos, A.; Sánchez, C.I.; Ledesma-Carbayo, M.J. Automatic Glaucoma Classification Using Color Fundus Images Based on Convolutional Neural Networks and Transfer Learning. Biomed. Opt. Express 2019, 10, 892–913. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Xu, Y.; Wong, D.W.K.; Wong, T.Y.; Liu, J. Glaucoma Detection Based on Deep Convolutional Neural Network. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; pp. 715–718. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
- Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. In Proceedings of the Advances in Neural Information Processing Systems 2021, Virtual, 6–14 December 2021; Volume 34, pp. 9355–9366. [Google Scholar] [CrossRef]
- Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision Transformers for Remote Sensing Image Classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
- Zheng, Y.; Jiang, W. Evaluation of Vision Transformers for Traffic Sign Classification. Wirel. Commun. Mob. Comput. 2022, 2022, 3041117. [Google Scholar] [CrossRef]
- Ghali, R.; Akhloufi, M.A.; Jmal, M.; Souidene Mseddi, W.; Attia, R. Wildfire Segmentation Using Deep Vision Transformers. Remote Sens. 2021, 13, 3527. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jegou, H. Going Deeper with Image Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 32–42. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training Data-Efficient Image Transformers & Distillation through Attention. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Volume 139, pp. 10347–10357. [Google Scholar] [CrossRef]
- Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W. Incorporating Convolution Designs into Visual Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 559–568. [Google Scholar] [CrossRef]
- d’Ascoli, S.; Touvron, H.; Leavitt, M.L.; Morcos, A.S.; Biroli, G.; Sagun, L. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. J. Stat. Mech. Theory Exp. 2022, 2022, 114005. [Google Scholar] [CrossRef]
- Rao, S.; Li, Y.; Ramakrishnan, R.; Hassaine, A.; Canoy, D.; Cleland, J.; Lukasiewicz, T.; Salimi-Khorshidi, G.; Rahimi, K. An Explainable Transformer-Based Deep Learning Model for the Prediction of Incident Heart Failure. IEEE J. Biomed. Health Inf. 2022, 26, 3362–3372. [Google Scholar] [CrossRef] [PubMed]
- Vaid, A.; Jiang, J.; Sawant, A.; Lerakis, S.; Argulian, E.; Ahuja, Y.; Lampert, J.; Charney, A.; Greenspan, H.; Narula, J.; et al. A Foundational Vision Transformer Improves Diagnostic Performance for Electrocardiograms. NPJ Digit. Med. 2023, 6, 108. [Google Scholar] [CrossRef] [PubMed]
- Nerella, S.; Bandyopadhyay, S.; Zhang, J.; Contreras, M.; Siegel, S.; Bumin, A.; Silva, B.; Sena, J.; Shickel, B.; Bihorac, A.; et al. Transformers in Healthcare: A Survey. arXiv 2023, arXiv:2307.00067. [Google Scholar]
- Mohan, N.J.; Murugan, R.; Goel, T.; Roy, P. ViT-DR: Vision Transformers in Diabetic Retinopathy Grading Using Fundus Images. In Proceedings of the 2022 IEEE 10th Region 10 Humanitarian Technology Conference (R10-HTC), Hyderabad, India, 16–18 September 2022; pp. 167–172. [Google Scholar] [CrossRef]
- Jiang, Z.; Wang, L.; Wu, Q.; Shao, Y.; Shen, M.; Jiang, W.; Dai, C. Computer-Aided Diagnosis of Retinopathy Based on Vision Transformer. J. Innov. Opt. Health Sci. 2022, 15, 2250009. [Google Scholar] [CrossRef]
- Wassel, M.; Hamdi, A.M.; Adly, N.; Torki, M. Vision Transformers Based Classification for Glaucomatous Eye Condition. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 5082–5088. [Google Scholar] [CrossRef]
- Mallick, S.; Paul, J.; Sengupta, N.; Sil, J. Study of Different Transformer Based Networks for Glaucoma Detection. In Proceedings of the TENCON 2022–2022 IEEE Region 10 Conference (TENCON), Hong Kong, 1–4 November 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. ResMLP: Feedforward Networks for Image Classification with Data-Efficient Training. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5314–5321. [Google Scholar] [CrossRef]
- Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
- Fan, R.; Alipour, K.; Bowd, C.; Christopher, M.; Brye, N.; Proudfoot, J.A.; Goldbaum, M.H.; Belghith, A.; Girkin, C.A.; Fazio, M.A.; et al. Detecting Glaucoma from Fundus Photographs Using Deep Learning without Convolutions: Transformer for Improved Generalization. Ophthalmol. Sci. 2023, 3, 100233. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Los Alamitos, CA, USA, 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Los Alamitos, CA, USA, 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
- Fumero Batista, F.J.; Diaz-Aleman, T.; Sigut, J.; Alayon, S.; Arnay, R.; Angel-Pereira, D. RIM-ONE DL: A Unified Retinal Image Database for Assessing Glaucoma Using Deep Learning. Image Anal. Stereol. 2020, 39, 161–167. [Google Scholar] [CrossRef]
- Orlando, J.I.; Fu, H.; Breda, J.B.; van Keer, K.; Bathula, D.R.; Diaz-Pinto, A.; Fang, R.; Heng, P.-A.; Kim, J.; Lee, J.; et al. REFUGE Challenge: A Unified Framework for Evaluating Automated Methods for Glaucoma Assessment from Fundus Photographs. Med. Image Anal. 2020, 59, 101570. [Google Scholar] [CrossRef]
- Sivaswamy, J.; Krishnadas, S.R.; Datt Joshi, G.; Jain, M.; Syed Tabish, A.U. Drishti-GS: Retinal Image Dataset for Optic Nerve Head(ONH) Segmentation. In Proceedings of the 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI), Beijing, China, 29 April–2 May 2014; pp. 53–56. [Google Scholar] [CrossRef]
- Kovalyk, O.; Morales-Sánchez, J.; Verdú-Monedero, R.; Sellés-Navarro, I.; Palazón-Cabanes, A.; Sancho-Gómez, J.-L. PAPILA: Dataset with Fundus Images and Clinical Data of Both Eyes of the Same Patient for Glaucoma Assessment. Sci. Data 2022, 9, 291. [Google Scholar] [CrossRef] [PubMed]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016, arXiv:1603.04467. Available online: www.tensorflow.org (accessed on 4 October 2023).
- TorchVision—TorchVision 0.15 Documentation. Available online: https://pytorch.org/vision/stable/index.html (accessed on 4 October 2023).
- DeiT GitHub from the Meta Research Group. Available online: https://github.com/facebookresearch/deit (accessed on 4 October 2023).
- GitHub of the Hong Kong University of Science and Technology. Available online: https://github.com/coeusguo/ceit (accessed on 4 October 2023).
- GitHub of the Microsoft Group. Available online: https://github.com/microsoft/Swin-Transformer (accessed on 4 October 2023).
- GitHub of the Meituan-AutoML Group. Available online: https://github.com/Meituan-AutoML/Twins (accessed on 4 October 2023).
- ConViT GitHub from the Meta Research Group. Available online: https://github.com/facebookresearch/convit (accessed on 4 October 2023).
- Brzezinski, D.; Stefanowski, J.; Susmaga, R.; Szczȩch, I. Visual-Based Analysis of Classification Measures and Their Properties for Class Imbalanced Problems. Inf. Sci. 2018, 462, 242–261. [Google Scholar] [CrossRef]
Architecture | Fold | Sensitivity | Specificity | Accuracy | B. Accuracy | F1 Score | |
---|---|---|---|---|---|---|---|
CNNs | VGG19 | fold_1 | 0.8767 | 0.9467 | 0.9122 | 0.9117 | 0.9078 |
VGG19 | fold_4 | 0.9863 | 0.9467 | 0.9662 | 0.9665 | 0.9664 | |
ResNet50 | fold_1 | 0.9315 | 0.9067 | 0.9189 | 0.9191 | 0.9189 | |
ResNet50 | fold_3 | 0.9863 | 0.9600 | 0.9730 | 0.9732 | 0.9730 | |
InceptionV3 | fold_2 | 0.9452 | 0.8667 | 0.9054 | 0.9059 | 0.9079 | |
InceptionV3 | fold_3 | 0.9726 | 0.9067 | 0.9392 | 0.9396 | 0.9404 | |
Xception | fold_5 | 0.9315 | 0.8133 | 0.8716 | 0.8724 | 0.8774 | |
Xception | fold_1 | 0.9452 | 0.9067 | 0.9257 | 0.9259 | 0.9262 | |
ViTs | Original ViT | fold_5 | 0.9178 | 0.8933 | 0.9054 | 0.9056 | 0.9054 |
Original ViT | fold_4 | 0.9726 | 0.8667 | 0.9189 | 0.9196 | 0.9221 | |
Swin Base | fold_5 | 0.9452 | 0.8667 | 0.9054 | 0.9059 | 0.9079 | |
Swin Base | fold_1 | 0.9863 | 0.9333 | 0.9595 | 0.9598 | 0.9600 | |
Swin Tiny | fold_1 | 0.9315 | 0.8933 | 0.9122 | 0.9124 | 0.9128 | |
Swin Tiny | fold_2 | 0.9863 | 0.8933 | 0.9392 | 0.9398 | 0.9412 | |
Twins-PCPVT | fold_3 | 0.9452 | 0.8800 | 0.9122 | 0.9126 | 0.9139 | |
Twins-PCPVT | fold_2 | 1.0000 | 0.9333 | 0.9662 | 0.9667 | 0.9669 | |
Hybrid models | DeiT | fold_5 | 0.9178 | 0.9200 | 0.9189 | 0.9189 | 0.9178 |
DeiT | fold_4 | 0.9863 | 0.9600 | 0.9730 | 0.9732 | 0.9730 | |
CaiT_XXS36 | fold_1 | 0.9041 | 0.8800 | 0.8919 | 0.8921 | 0.8919 | |
CaiT_XXS36 | fold_4 | 0.9315 | 0.9333 | 0.9324 | 0.9324 | 0.9315 | |
CaiT_S24 | fold_2 | 0.9178 | 0.8800 | 0.8986 | 0.8989 | 0.8993 | |
CaiT_S24 | fold_4 | 0.9178 | 0.9333 | 0.9257 | 0.9256 | 0.9241 | |
CeiT | fold_3 | 0.9041 | 0.8667 | 0.8851 | 0.8854 | 0.8859 | |
CeiT | fold_2 | 0.9863 | 0.9067 | 0.9459 | 0.9465 | 0.9474 | |
ConViT | fold_3 | 0.9589 | 0.8933 | 0.9257 | 0.9261 | 0.9272 | |
ConViT | fold_5 | 0.9863 | 0.9333 | 0.9595 | 0.9598 | 0.9600 | |
Others inspired in ViT | ResMLP_12 | fold_3 | 0.9315 | 0.8533 | 0.8919 | 0.8924 | 0.8947 |
ResMLP_12 | fold_1 | 0.9452 | 0.9067 | 0.9257 | 0.9259 | 0.9262 | |
ResMLPB_24 | fold_1 | 0.9452 | 0.9467 | 0.9459 | 0.9459 | 0.9452 | |
ResMLPB_24 | fold_5 | 0.9863 | 0.9467 | 0.9662 | 0.9665 | 0.9664 |
Architecture | Fold | Sensitivity | Specificity | Accuracy | B. Accuracy | F1 Score | |
---|---|---|---|---|---|---|---|
CNNs | VGG19 | fold_4 | 0.7833 | 0.8537 | 0.8467 | 0.8185 | 0.5054 |
VGG19 | fold_5 | 0.8833 | 0.8898 | 0.8892 | 0.8866 | 0.6145 | |
ResNet50 | fold_5 | 0.8083 | 0.8065 | 0.8067 | 0.8074 | 0.4554 | |
ResNet50 | fold_2 | 0.8417 | 0.9009 | 0.8950 | 0.8713 | 0.6159 | |
InceptionV3 | fold_4 | 0.6750 | 0.9843 | 0.9533 | 0.8296 | 0.7431 | |
InceptionV3 | fold_3 | 0.8500 | 0.9389 | 0.9300 | 0.8944 | 0.7083 | |
Xception | fold_5 | 0.9250 | 0.6898 | 0.7133 | 0.8074 | 0.3922 | |
Xception | fold_2 | 0.8083 | 0.8963 | 0.8875 | 0.8523 | 0.5897 | |
ViTs | Original ViT | fold_1 | 0.7000 | 0.9667 | 0.9400 | 0.8333 | 0.7000 |
Original ViT | fold_4 | 0.8167 | 0.9343 | 0.9225 | 0.8755 | 0.6782 | |
Swin Base | fold_2 | 0.6833 | 0.9389 | 0.9133 | 0.8111 | 0.6119 | |
Swin Base | fold_5 | 0.8167 | 0.9194 | 0.9092 | 0.8681 | 0.6426 | |
Swin Tiny | fold_1 | 0.5500 | 0.9454 | 0.9058 | 0.7477 | 0.5388 | |
Swin Tiny | fold_2 | 0.7667 | 0.9315 | 0.9150 | 0.8491 | 0.6434 | |
Twins-PCPVT | fold_4 | 0.7750 | 0.8546 | 0.8467 | 0.8148 | 0.5027 | |
Twins-PCPVT | fold_2 | 0.8417 | 0.8713 | 0.8683 | 0.8565 | 0.5611 | |
Hybrid models | DeiT | fold_5 | 0.7417 | 0.9370 | 0.9175 | 0.8394 | 0.6426 |
DeiT | fold_4 | 0.7833 | 0.9352 | 0.9200 | 0.8593 | 0.6620 | |
CaiT_XXS36 | fold_3 | 0.5000 | 0.9741 | 0.9267 | 0.7370 | 0.5769 | |
CaiT_XXS36 | fold_5 | 0.7917 | 0.9083 | 0.8967 | 0.8500 | 0.6051 | |
CaiT_S24 | fold_3 | 0.7083 | 0.9417 | 0.9183 | 0.8250 | 0.6343 | |
CaiT_S24 | fold_1 | 0.8333 | 0.9009 | 0.8942 | 0.8671 | 0.6116 | |
CeiT | fold_3 | 0.6833 | 0.8481 | 0.8317 | 0.7657 | 0.4481 | |
CeiT | fold_4 | 0.7917 | 0.8935 | 0.8833 | 0.8426 | 0.5758 | |
ConViT | fold_1 | 0.6250 | 0.9667 | 0.9325 | 0.7958 | 0.6494 | |
ConViT | fold_2 | 0.8333 | 0.9435 | 0.9325 | 0.8884 | 0.7117 | |
Others inspired in ViT | ResMLP_12 | fold_4 | 0.6417 | 0.8750 | 0.8517 | 0.7583 | 0.4639 |
ResMLP_12 | fold_1 | 0.8333 | 0.8296 | 0.8300 | 0.8315 | 0.4950 | |
ResMLPB_24 | fold_2 | 0.7250 | 0.9833 | 0.9575 | 0.8542 | 0.7733 | |
ResMLPB_24 | fold_3 | 0.8333 | 0.9380 | 0.9275 | 0.8856 | 0.6969 |
Architecture | Fold | Sensitivity | Specificity | Accuracy | B. Accuracy | F1 Score | |
---|---|---|---|---|---|---|---|
CNNs | VGG19 | fold_4 | 0.8857 | 0.7419 | 0.8416 | 0.8138 | 0.8857 |
VGG19 | fold_3 | 0.8429 | 0.8065 | 0.8317 | 0.8247 | 0.8741 | |
ResNet50 | fold_3 | 0.8143 | 0.7419 | 0.7921 | 0.7781 | 0.8444 | |
ResNet50 | fold_1 | 0.9286 | 0.7742 | 0.8812 | 0.8514 | 0.9155 | |
InceptionV3 | fold_2 | 0.8571 | 0.7419 | 0.8218 | 0.7995 | 0.8696 | |
InceptionV3 | fold_5 | 0.8857 | 0.7742 | 0.8515 | 0.8300 | 0.8921 | |
Xception | fold_2 | 0.8714 | 0.6774 | 0.8119 | 0.7744 | 0.8652 | |
Xception | fold_3 | 0.8286 | 0.7419 | 0.8020 | 0.7853 | 0.8529 | |
ViTs | Original ViT | fold_2 | 0.5429 | 0.9032 | 0.6535 | 0.7230 | 0.6847 |
Original ViT | fold_3 | 0.7857 | 0.8387 | 0.8020 | 0.8122 | 0.8462 | |
Swin Base | fold_2 | 0.6429 | 0.8065 | 0.6931 | 0.7247 | 0.7438 | |
Swin Base | fold_5 | 0.8714 | 0.7742 | 0.8416 | 0.8228 | 0.8841 | |
Swin Tiny | fold_2 | 0.6857 | 0.8387 | 0.7327 | 0.7622 | 0.7805 | |
Swin Tiny | fold_3 | 0.8000 | 0.8065 | 0.8020 | 0.8032 | 0.8485 | |
Twins-PCPVT | fold_3 | 0.6571 | 0.8065 | 0.7030 | 0.7318 | 0.7541 | |
Twins-PCPVT | fold_2 | 0.9000 | 0.6774 | 0.8317 | 0.7887 | 0.8811 | |
Hybrid models | DeiT | fold_2 | 0.6286 | 0.8710 | 0.7030 | 0.7498 | 0.7458 |
DeiT | fold_5 | 0.8429 | 0.7419 | 0.8119 | 0.7924 | 0.8613 | |
CaiT_XXS36 | fold_3 | 0.6000 | 0.8387 | 0.6733 | 0.7194 | 0.7179 | |
CaiT_XXS36 | fold_1 | 0.7571 | 0.9032 | 0.8020 | 0.8302 | 0.8413 | |
CaiT_S24 | fold_4 | 0.6714 | 0.8065 | 0.7129 | 0.7389 | 0.7642 | |
CaiT_S24 | fold_1 | 0.8286 | 0.8065 | 0.8218 | 0.8175 | 0.8657 | |
CeiT | fold_1 | 0.7857 | 0.6774 | 0.7525 | 0.7316 | 0.8148 | |
CeiT | fold_3 | 0.7286 | 0.8387 | 0.7624 | 0.7836 | 0.8095 | |
ConViT | fold_1 | 0.6857 | 0.8065 | 0.7228 | 0.7461 | 0.7742 | |
ConViT | fold_3 | 0.8429 | 0.7742 | 0.8218 | 0.8085 | 0.8676 | |
Others inspired in ViT | ResMLP_12 | fold_3 | 0.5286 | 0.9032 | 0.6436 | 0.7159 | 0.6727 |
ResMLP_12 | fold_5 | 0.8429 | 0.8065 | 0.8317 | 0.8247 | 0.8741 | |
ResMLPB_24 | fold_2 | 0.7429 | 0.7742 | 0.7525 | 0.7585 | 0.8062 | |
ResMLPB_24 | fold_3 | 0.9143 | 0.7419 | 0.8614 | 0.8281 | 0.9014 |
Architecture | Fold | Sensitivity | Specificity | Accuracy | B. Accuracy | F1 Score | |
---|---|---|---|---|---|---|---|
CNNs | VGG19 | fold_5 | 0.7816 | 0.7538 | 0.7595 | 0.7677 | 0.5738 |
VGG19 | fold_2 | 0.7586 | 0.8438 | 0.8262 | 0.8012 | 0.6439 | |
ResNet50 | fold_5 | 0.7011 | 0.7868 | 0.7690 | 0.7440 | 0.5571 | |
ResNet50 | fold_2 | 0.7126 | 0.8679 | 0.8357 | 0.7903 | 0.6425 | |
InceptionV3 | fold_2 | 0.8276 | 0.6907 | 0.7190 | 0.7591 | 0.5496 | |
InceptionV3 | fold_1 | 0.7931 | 0.8018 | 0.8000 | 0.7975 | 0.6216 | |
Xception | fold_3 | 0.8851 | 0.6336 | 0.6857 | 0.7593 | 0.5385 | |
Xception | fold_1 | 0.7931 | 0.7538 | 0.7619 | 0.7734 | 0.5798 | |
ViTs | Original ViT | fold_4 | 0.8621 | 0.5000 | 0.5748 | 0.6810 | 0.4559 |
Original ViT | fold_5 | 0.8161 | 0.7126 | 0.7340 | 0.7643 | 0.5591 | |
Swin Base | fold_4 | 0.5862 | 0.7934 | 0.7506 | 0.6898 | 0.4928 | |
Swin Base | fold_1 | 0.7356 | 0.8174 | 0.8005 | 0.7765 | 0.6038 | |
Swin Tiny | fold_5 | 0.8161 | 0.6048 | 0.6485 | 0.7104 | 0.4897 | |
Swin Tiny | fold_4 | 0.7356 | 0.7814 | 0.7720 | 0.7585 | 0.5714 | |
Twins-PCPVT | fold_5 | 0.8046 | 0.5299 | 0.5867 | 0.6673 | 0.4459 | |
Twins-PCPVT | fold_3 | 0.7701 | 0.7844 | 0.7815 | 0.7773 | 0.5929 | |
Hybrid models | DeiT | fold_5 | 0.7931 | 0.6048 | 0.6437 | 0.6989 | 0.4792 |
DeiT | fold_4 | 0.6552 | 0.8443 | 0.8052 | 0.7497 | 0.5816 | |
CaiT_XXS36 | fold_3 | 0.6437 | 0.8234 | 0.7862 | 0.7335 | 0.5545 | |
CaiT_XXS36 | fold_1 | 0.7241 | 0.7814 | 0.7696 | 0.7528 | 0.5650 | |
CaiT_S24 | fold_4 | 0.6437 | 0.8353 | 0.7957 | 0.7395 | 0.5657 | |
CaiT_S24 | fold_1 | 0.7701 | 0.8054 | 0.7981 | 0.7878 | 0.6119 | |
CeiT | fold_4 | 0.7126 | 0.7246 | 0.7221 | 0.7186 | 0.5145 | |
CeiT | fold_5 | 0.7701 | 0.8054 | 0.7981 | 0.7878 | 0.6119 | |
ConViT | fold_5 | 0.7931 | 0.6317 | 0.6651 | 0.7124 | 0.4946 | |
ConViT | fold_1 | 0.7471 | 0.7305 | 0.7340 | 0.7388 | 0.5372 | |
Others inspired in ViT | ResMLP_12 | fold_4 | 0.6092 | 0.7605 | 0.7292 | 0.6848 | 0.4818 |
ResMLP_12 | fold_1 | 0.5862 | 0.8922 | 0.8290 | 0.7392 | 0.5862 | |
ResMLPB_24 | fold_4 | 0.7126 | 0.7964 | 0.7791 | 0.7545 | 0.5714 | |
ResMLPB_24 | fold_5 | 0.7241 | 0.8683 | 0.8385 | 0.7962 | 0.6495 |
Ranking Position | Test Set | Refuge | Drishti-GS1 | Papila |
---|---|---|---|---|
1 | ResNet50 97.32% | InceptionV3 89.44% | ResNet50 85.14% | VGG19 80.12% |
2 | DeiT 97.32% | ConViT 88.84% | CaiT_XXS36 83.02% | InceptionV3 79.75% |
3 | Twins-PCPVT 96.67% | VGG19 88.66% | InceptionV3 83.00% | ResMLPB_24 79.62% |
4 | ResMLPB_24 96.65% | ResMLPB_24 88.56% | ResMLPB_24 82.81% | ResNet50 79.03% |
5 | VGG19 96.65% | Original ViT 87.55% | VGG19 82.47% | CaiT_S24 78.78% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alayón, S.; Hernández, J.; Fumero, F.J.; Sigut, J.F.; Díaz-Alemán, T. Comparison of the Performance of Convolutional Neural Networks and Vision Transformer-Based Systems for Automated Glaucoma Detection with Eye Fundus Images. Appl. Sci. 2023, 13, 12722. https://doi.org/10.3390/app132312722
Alayón S, Hernández J, Fumero FJ, Sigut JF, Díaz-Alemán T. Comparison of the Performance of Convolutional Neural Networks and Vision Transformer-Based Systems for Automated Glaucoma Detection with Eye Fundus Images. Applied Sciences. 2023; 13(23):12722. https://doi.org/10.3390/app132312722
Chicago/Turabian StyleAlayón, Silvia, Jorge Hernández, Francisco J. Fumero, Jose F. Sigut, and Tinguaro Díaz-Alemán. 2023. "Comparison of the Performance of Convolutional Neural Networks and Vision Transformer-Based Systems for Automated Glaucoma Detection with Eye Fundus Images" Applied Sciences 13, no. 23: 12722. https://doi.org/10.3390/app132312722
APA StyleAlayón, S., Hernández, J., Fumero, F. J., Sigut, J. F., & Díaz-Alemán, T. (2023). Comparison of the Performance of Convolutional Neural Networks and Vision Transformer-Based Systems for Automated Glaucoma Detection with Eye Fundus Images. Applied Sciences, 13(23), 12722. https://doi.org/10.3390/app132312722