1. Introduction
Diabetic retinopathy (DR) is an eye disease caused by high blood sugar levels that attack the retinal capillaries of the eye [
1]. DR is the leading cause of blindness suffered by working-age adults [
2,
3]. In 2015 blindness due to DR was estimated to reach 2.6 million people and the projected results on the number of people with DR in 2020 reached 3.2 million [
1]. The prevalence of DR is estimated to increase in the next decade along with the increase in diabetes, especially in Asian countries, such as Indonesia, India, and China [
4]. The increasing prevalence of DR can be prevented by early detection and treatment of eye damage caused by DR [
5]. A rapid screening process is required so that people with DR receive immediate and appropriate treatment [
6]. This screening process can be carried out by utilizing technological advances, that is, the Computer Aided Diagnosis (CAD) system. CAD system development can diagnose DR efficiently. High computational mechanisms enable better diagnostic capabilities by classifying fundus image data to identify damage to the retina [
7]. CAD begins with pre-processing image data to optimize data in the learning model [
8].
CAD in diagnosing DR has been widely used in previous studies. DR detection by applying the Gray Level Co-Occurrence Matrix (GLCM) and Support Vector Machine (SVM) methods were conducted to obtain accurate results. Accurate results in DR and normal diagnoses reached 82.5%, while in PDR and NPDR diagnoses, the accuracy reached 100% [
9]. One of the difficulties in DR detection is studying the features in the fundus image [
7]. The process of studying features can be carried out by implementing the convolutional neural network (CNN) algorithm [
10]. A study by Navoneel Chakrabarty showed that the CNN algorithm produced high accuracy in the DR classification process, which was 100% [
11].
CNN is a deep learning algorithm that has several different architectures. The architectures in the CNN algorithm include GoogleNet, ResNet, and DenseNet [
12]. Previous researchers have widely used the architectures in the CNN algorithm. A research by R Anand used the GoogleNet architecture on the CNN algorithm to detect faces. The overall accuracy result reached 91.43%, which was quite high for facial recognition and more than conventional Machine Learning (ML) techniques. The number of data trained on the model also affects the accuracy of a classification system. The more data trained with optimal computing power, the more accurate the prediction results, up to 99% [
13]. Arpana Mahajan has studied ResNet architecture to observe the features of categorical images. The features that have been studied are further classified using SVM. ResNet architecture was tested based on the number of layers. The test results showed that ResNet with 18 layers obtained a higher accuracy of 93.57% [
14]. Research conducted by Hua Li used DenseNet architecture to classify benign and malignant mammogram images. The accuracy obtained from the DenseNet architecture reached 94.55% [
15]. Based on previous studies, the CNN algorithm performed well in the classification process and had many layers. A large number of layers in the CNN algorithm required a computer with a large capacity and a long duration of the training process [
16]. Several other researchers have tried to develop CNN to overcome these problems by changing the existing classification system on CNN to form a developed method that uses the convolutional features in the CNN architecture but uses a different classification method.
The development of CNN methods, such as the Convolutional Extreme Learning Machine (CELM), which uses the convolutional features in the CNN architecture and the Extreme Learning Machine classification method. Research on CELM was conducted to identify handwritten MNIST dataset [
17,
18,
19]. The results of this study indicated that the accuracy obtained by CELM was better than ELM and CNN, that is, 98.43%, with a training time faster than CNN and ELM. Although CELM is better than CNN, basically, ELM is still a single hidden layer method and is still not good at pattern recognition in big data; so, the development of the ELM method by applying a multilayer and deep learning system is called Deep Extreme Learning Machine (DELM) [
20]. The DELM method has several advantages, especially in training time, making it one of the deep learning methods with the fastest training process. DELM also has good results in terms of image classification (MNIST database, CIFAR-10 dataset, and Google Streetview House Number dataset) with an average accuracy of 95.16% [
21]. The DELM algorithm can produce a high accuracy in just 9.02 s [
22]. DELM is a combination of several algorithms that are the result of the development of the Extreme Learning Machine (ELM) algorithm. DELM has a more complex structure than ELM, but DELM can train models faster than the ELM algorithm [
23].
Based on the description of the problems, the convolutional features in the CNN architecture can well recognize the pattern of an image. Therefore, in the classification process by DELM, it is necessary to use the convolutional features in the CNN architecture for feature extraction. This study aims to build a Hybrid CNN-DELM or Convolutional Deep Extreme Learning Machine (CDELM) method that can collectively recognize image patterns to produce performance for better accuracy and faster training time. The CDELM method is applied in this study using the CNN architectures, that is GoogleNet, ResNet, and DenseNet.
3. Results
In this research, classifications were made with two multi-class experiments: 2-class (Normal and DR) and 4-class (Normal, Mild, Moderate, and Severe). The initial stage of the research is cropping the image, CLAHE, resizing according to the CNN architecture input size, and performing the augmentation process. The results of the CLAHE process are shown in
Figure 8.
Based on the results of the CLAHE process in
Figure 6, it shows that there is an increase in image quality. DR disease is highly dependent on the state of the blood vessels in the retina. The CLAHE method can clearly show the condition of the blood vessels and identify the presence of microaneurysm to bleeding in the retina. The following preprocessing stage is resizing and augmentation. The augmentation method used was a random rotation from 1 degree to 359 degrees. Augmentation results can be seen in
Figure 9.
The next stage after augmentation is feature extraction from the fundus image. This stage is carried out by comparing several CNN architectures: GoogleNet, ResNet, and DenseNet. The feature extraction results on each CNN architecture can be seen in
Table 1. The results of CNN feature extraction are obtained from feature learning through several convolution processes, pooling, and applying activation functions. These steps are repeated until a vector of the extraction results from each image data is obtained.
Based on
Table 1, the results of feature extraction from each architecture show different values and features. The data is divided into training and testing data using five-fold cross-validation. In the two-class dataset, training data contains 2880 images in each class and 720 images in each class for testing data. While in the four-class dataset, training data has 2607 images in each class and 651 images in each class for the testing dataset. Next is the classification process using the DELM method, which is evaluated based on accuracy, sensitivity, specificity, and duration of training time. It is used to determine the most optimal model of the DR classification system. Then, CDELM performance on the 2-class and 4-class DRIVE and MESSIDOR data is compared. The results of experiment conducted on the 2-class DRIVE and MESSIDOR data on several CNN architectures are shown in
Table 2 and
Table 3. The bold values in
Table 2,
Table 3,
Table 4 and
Table 5 mean the best result of the kernel experiment.
The features obtained from several CNN architectures, such as ResNet-18, ResNet-50, ResNet-101, GoogleNet, and DenseNet, can represent each class in the 2-class DRIVE data by the accuracy value, which reaches more than 90%. The high accuracy value is also influenced by the performance of DELM as a classification method that can classify many features well. Another advantage of the DELM classification method is that it can be seen from the duration of the training time, which is less than 5 min. The performance of the DELM method is highly dependent on the compatibility of the data with the kernel functions used. In the 2-class DRIVE data, the data shows that the polynomial kernel function performs better than the linear kernel or the RBF kernel. The graph to compare the accuracy values of the two-class DRIVE dataset is shown in
Figure 10.
Based on
Figure 10, using a polynomial kernel in the DELM classification can achieve 100% accuracy in each architecture. The DELM classification system using a linear kernel only obtains accuracy in the range of 90% to 98%. Compared to the linear kernel, the RBF kernel is able to better separate data from each class, with an accuracy value of 92% to 99%. As the best CNN architecture in terms of the average accuracy of each architecture, ResNet-101 architecture can represent the characteristics of the normal fundus and DR images with the best accuracy value of 99.17%, followed by ResN=et-50, DenseNet, GoogleNet, and ResNet-18 architectures, with an average accuracy of 98.10%, 97.87%, 96.16%, and 94.54%, respectively.
Similar to the previous experiment, in the two-class MESSIDOR dataset, the CDELM methods also perform well, with an accuracy above 90%. Therefore, it shows that the CNN architecture can well represent the fundus image features in the two-class MESSIDOR dataset. The combination of a suitable CNN method in extracting features with the DELM method, which has good performance in classification, plays an essential role in producing high accuracy in the classification system. The DELM classification process takes only approximately 4 min. The duration of the training is relatively shorter than the 2-class DRIVE 2 dataset experiment, which is 1 min faster than the previous. The graph on the accuracy value comparison from the two-class MESSIDOR dataset is shown in
Figure 11.
Based on
Figure 11, the highest accuracy obtained is 100% by using a polynomial kernel on each CNN architecture. It shows that using a polynomial kernel on the 2-class MESSIDOR data can separate the two classes very well. ResNet-101 architecture can represent the characteristics of the normal fundus and DR images with the best accuracy value of 99.03%, followed by ResNet-50, DenseNet, GoogleNet, and ResNet-18 architectures with an average accuracy of 98.31%, 97.59%, 96.39%, and 94.44%. The overall results of the experiment show that the more features generated from the feature extraction process using several CNN architectures, the higher the accuracy value obtained. The results of the experiment conducted on 4-class DRIVE and MESSIDOR dataset are shown in
Table 4 and
Table 5.
The classification of 4 class DRIVE data shows poor results on the ResNet-18, ResNet-50, and GoogleNet architectures with linear kernel functions. Compared to the depth of layer complexity, the three architectures are no more profound than the ResNet-101 and DenseNet architectures. It shows that the features produced by the three architectures do not represent the features of the fundus image and cannot be divided linearly. In contrast, the RBF kernel function and polynomial in each architecture produce an accuracy above 90%, which indicates the compatibility of the RBF kernel and polynomial functions to the fundus image features. The duration of the training time in the experiment using the 4- class DRIVE dataset is relatively short, ranging from 2 to 2.5 min. The graph on the accuracy value comparison from 4-class DRIVE dataset is shown in
Figure 12.
Based on
Figure 12, the entire experiments using the polynomial kernel obtain perfect results, which reach 100% accuracy. The DELM classification system using a linear kernel only produces an accuracy of 79% to 93%. Meanwhile, the experiment using the RBF kernel produces higher results than that using the linear kernel, with an accuracy of 84% to 95%. The best CNN architecture in extracting fundus image features in terms of the average accuracy value is the ResNet-101 architecture, with an average accuracy of 97.18%. The next best average accuracy result is 95.33% on DenseNet architecture, followed by ResNet-50, GoogleNet, and ResNet-18 architectures with an average accuracy of 95.33%, 94.68%, 82.42%, and 88.02% consecutively.
Based on
Table 5, the performance of the CDELM method on the 4-class MESSIDOR data is not as good as the CDELM performance on the 2-class MESSIDOR dataset and the 4-class DRIVE data. MESSIDOR data cannot be appropriately classified using the CDELM method with an accuracy of less than 70% on the linear kernel and RBF. The accuracy results in the experiment show that the fundus image features in the 4-class MESSIDOR dataset do not have significant differences between classes, so the DELM method cannot properly separate the data for each class. However, implementing the CDELM method using a polynomial kernel can obtain a pretty good classification result, which is more than 90% in each experiment, and the best result is 98.20%. The classification system using a polynomial kernel has a good accuracy value in every CNN architecture experiment with a relatively short training duration. The graph on the accuracy value comparison from the 4-class MESSIDOR dataset is shown in
Figure 13.
Figure 11 shows that in each CNN architecture test, the polynomial kernel has a good performance in the 4-class MESSIDOR data classification. The accuracy of the classification system using the polynomial kernel function significantly differs significantly from that using the linear kernel and the RBF kernel. The difference in accuracy values in the linear kernel reaches 25% to 35%, while in the RBF kernel, it reaches 25% to 33%. The results of the performance comparison of each CNN architecture show that ResNet-101 is superior to other CNN architectures with an accuracy of 98.20%, followed by DenseNet, ResNet-50, GoogleNet, and ResNet-18. CNN architectures (GoogleNet, ResNet18, ResNet50, ResNet101, and DenseNet) have different extraction computation concepts. It certainly affects the value of feature extraction, which is used as data input in the classification process. Based on the overall results above, the selection of a good image extraction method greatly affects the evaluation results of the classification system. The results of the comparison of each CNN architecture can be seen in
Figure 14.
Several CNN architectures were tested to well determine the results of the DR classification. Based on
Figure 14, it shows that the highest overall values are obtained by the ResNet101 architecture. The average accuracy is 92.88%, the sensitivity is 92.88%, and the specificity is 92.84%. However, in computing time, ResNet101 takes the longest time compared to other architectures, with an average time of 223.92 s. The difference in accuracy values for each architecture is only approximately 4%, and the time difference is only approximately 10 s. It shows that the CNN architecture has almost the same performance. However, in the experiment conducted on 4 class data, it shows significant differences on CNN architectures. ResNet101 is a CNN architecture that has the best performance in classifying DR. The comparison of the average accuracy based on the kernel function in the whole experiment is shown in
Figure 15.
This study experimented with various kernels on the DELM model, namely linear, polynomial, and RBF. The results of
Figure 15 show that the polynomial kernel type achieved the best results in every type of class (2 class and 4 class). In the 4-class experiment, the average evaluation value for each kernel decreased by 11.6% for linear, 1.6% for polynomial, and 13.8% for RBF. The smallest reduction number was the kernel polynomial. From all DELM experiments, polynomials can separate the features generated from each architecture very well. It is because the features obtained from the CNN feature extraction process have complex features. In contrast, the polynomial is more suitable for classifying global features than the RBF method. The linear kernel produces a lower accuracy than the RBF kernel, and the polynomial indicates that the fundus feature data cannot be separated linearly. Meanwhile, the three kernels have a time difference of 1 to 5 s in computing time. Therefore, the selection of the DELM kernel type has less effect on the computation time.
The number of data highly affects CDELM performance. This study conducted several experiments by taking the number of datasets on the Messidor dataset. The first experiment was conducted on the number of data according to the minimum class in the DRIVE dataset. The results of the first experiment were obtained from the best results in
Table 6, that is classification using ResNet101 feature extraction with polynomial kernel on DELM. The second experiment used the same architecture as the first one, but the there were 20,000 data. The third experiment used a total of 50,000 data. Furthermore in the fourth experiment, the number of data taken was 60,000. The experiment is limited to 60,000 data collection because DELM has limitations on large number of data. When the data is very large or more than 60,000 data, multiplying a large number of square matrices in DELM causes errors during the training process.
The results of the four experiments in
Figure 16 show that the more data, the better the classification system. The value of accuracy increases in line with the higher computational time required in the classification process. From the comprehensive test, ResNet-101 has a good performance. Compared to conventional ResNet-101 with 64 batch size, ResNet101-DELM has a better performance and a shorter training time duration. It is evidenced by the experiment to compare the performance of ResNet101 and ResNet101-DELM, as shown in
Table 7.
Based on the results of a performance comparison of ResNet101 and ResNet101-DELM in
Table 7, the graphs on comparing the accuracy and duration of training time for 2 class and 4 class are shown in
Figure 17.
Based on all experiments, ResNet101-DELM produces a higher accuracy than conventional ResNet-101 in which the accuracy values reach almost 100% in several experiments. Combining CDELM with ResNet101 can increase accuracy by 0.01% on the DRIVE data and up to 7% on the MESSIDOR data. In addition, the ResNet101-DELM method requires a much shorter time than conventional ResNet-101, with a time difference of 300 to 1000 s. It shows that ResNet101-DELM is more optimal than conventional ResNet-101. A comparison of the results of the evaluation of the fundus image classification system in identifying diabetic retinopathy from several previous studies is shown in
Table 8.
Based on
Table 8, the ResNet101-DELM hybrid performed well in fundus image classification for diabetic retinopathy identification. Research [
46], by customizing five layers of CNN and segmentation process resulted in an accuracy value of 98.15% on the MESSIDOR dataset. Research [
12] modifies the CNN Alexnet architecture with input images using only green channels; on MESSIDOR data this study produces an accuracy of 92.35%. Research [
47] using the ResNet101 method on the DRIVE dataset only obtained an accuracy of 95.10%, while in this study using the same method and the addition of CLAHE an accuracy of 100% was obtained. With the same accuracy, ResNet101-DELM has a much shorter training time. The sensitivity value shows how the classification system identifies normal fundus as DR. While specificity identifies the fundus image in a DR Class (Mild, Moderate, Severe) as a normal class; in medical cases, a classification system with a higher sensitivity value is more efficient than a specificity value because when a normal fundus is identified as DR, it will increase the patient’s awareness about DR. Based on a comparison with several previous studies, CNN-DELM is an image classification method that has good performance and ashort training time. The good performance of CNN-DELM in image classification has a weakness, which is in the classification of large amounts of data, so data partitioning is needed to calculate the dimensions of the square matrix in the DELM method.