1. Introduction
The Gastrointestinal (GI) tract, or the digestive tract, is a system of organs responsible for digestion in humans. Like all other parts of the human body, the GI tract may have various diseases such as inflammatory diseases, autoimmune diseases, tumorous diseases, etc. To diagnose these diseases, clinical medical examination is required, which may include procedures like fecal occult blood test, endoscopy, etc. In general, colonoscopy is considered as the most accurate method for identifying lesions in one’s colon. In addition, colonoscopy is also used for therapeutic purposes. Colorectal cancer is the second leading cause of cancer deaths and is one of the five most important cancers to be screened. According to the National Statistical Office (NSO) survey in 2016, colorectal cancer was the third most common cause of lung cancer and liver cancer. In the past, the colorectal cancer incidence rate was high in Western countries. However, the incidence rate is rapidly increasing in South Korea as well. According to the adenoma-cancer continuum hypothesis, 95% of colorectal cancers occurring in the general population are advanced through the adenoma stage. Therefore, early detection and elimination of polyps during the adenoma stage are crucial in the prevention of colorectal cancer [
1]. Colorectal cancer is diagnosed by a physician, and the authenticity of this diagnosis depends on the physician’s experience. Although certain objective criteria are used to ensure an accurate diagnosis, however, most physicians tend to follow a subjective criterion. Therefore, the results of an endoscopic colposcopy have a very subjective disadvantage [
2]. The introduction of a system based on artificial intelligence can assist in obtaining a more accurate diagnosis and reduce human errors.
Attempts to analyze medical images using computerized methods date back many decades. The concept of Computer Aided Diagnosis (CAD) appeared in the 1970s when scanned medical images were analyzed on a computer. In the 1970s and 1990s, rule-based systems and expert systems were widely used. Rule-based systems use low-level image processing to extract edges and lines using filters. Mathematical structures are used to match circles and ellipses to obtain and analyze components. On the other hand, an expert system is evaluated as Good Old-Fashioned Artificial Intelligence (GOFAI) by analyzing the results of images using several conditional statements (if-else statements) [
3].
Training data to improve the performance of a system were popularized in the late 1990s. This process required two main steps, namely feature extraction and classification. Features such as color, shape, texture, etc., are extracted during the feature extraction step. The most crucial step is to extract important features to represent the image. The extracted features are analyzed using various machine learning algorithms. The work done by [
3,
4,
5,
6,
7] followed the same approach, which included feature extraction steps having several linear classifiers for classification. However, such methods rely on texture analysis, which requires expert knowledge about the features during the extraction process. Hence, they lack generalization and cannot be useful for transfer capabilities.
Currently, deep learning is widely used in medical image analysis. The 1998 paper by Yann LeCun [
8] laid the foundation of today’s deep learning. Artificial intelligence using deep learning has shown excellent results in various fields such as speech recognition, language discrimination, behavior recognition, and image retrieval. Mostly medical image analysis deals with diagnosing diseases and detecting the affected area. Disease diagnosis using artificial intelligence is an active research area due to the development and state-of-the-art performance of deep learning. Recently, CNN has been reported to be highly useful in the field of endoscopy, especially Esophagogastroduodenoscopy (EGD), capsule endoscopy, and colonoscopy. The works done by [
9,
10,
11,
12] utilized a CNN-based diagnostic system to localize and classify EGD images effectively. Further, it was also applied to colonoscopy images to detect and classify colorectal polyps [
13,
14,
15,
16], and it was shown that the CNN-based method outperformed the traditional hand-crafted features method. Other typical usages of deep learning for disease diagnosis include skin cancer screening and diabetic retinopathy diagnosis [
17,
18]. Esteva [
17] used Google’s Inception v3 [
19] model to recognize 757 types of skin cancer. In addition, CNN was used to measure the severity of knee osteoarthritis in X-ray images and to detect lymph nodes in [
20] and [
21], respectively. CNN has also achieved good results in detecting brain tumors [
22,
23] and in lung nodule classification [
24]. A better model and data are required to obtain good performance. Although the availability of a large amount of data has significantly increased performance, good quality training data are also needed to increase the diagnostic ability of the network. However, since medical images are obtained in a relatively controlled situation, they are stereotypical and can provide a good generalization performance even with a relatively small dataset. Furthermore, the number of layers in the network also plays a vital role to extract deep features from the images.
In this paper, we analyze the performance of the networks with the addition of different numbers of layers and propose a convolutional neural network that can classify normal colon, adenomatous polyp, and adenocarcinoma in colonoscopy images.
2. Image Classification Using Deep Learning
In the late 1990s, the LeNet [
8] architecture was used for image classification using deep learning. LeNet’s architecture consists of a convolution layer, a pooling layer, and a fully-connected layer. The architecture of most deep image classification methods is inspired by LeNet. The operation performed in the convolution layer is given by Equation (
1).
where
refers to the
ith pixel,
corresponds to the weight value used for convolution, * denotes the convolution operation, and
is the bias. The kernel
W slides across the image and performs the convolution operation between
and
. The convolution output for all pixels of an image results in a feature map
. Multiple kernels are used to generate multiple feature maps.
To reduce the image size, the pooling layer selects the maximum value in k-sized kernels at row r and column h in feature map F as shown in Equation (
2). This procedure is known as max-pooling. If the average value is extracted instead of the maximum value, it is known as average pooling.
AlexNet [
25], published in 2012, consists of five convolution and three fully-connected layers as shown in
Figure 1. To solve the problem of the vanishing gradient, AlexNet replaces the existing sigmoid function or hyperbolic tangent activation function with the Rectified Linear Units (ReLU) [
26] activation function as shown in Equation (
3). To reduce overfitting, the dropout [
27] method was applied for neural network learning. The AlexNet structure showed excellent results with a 15.4% test error rate in the image recognition part of ImageNet.
Two years after AlexNet was published, VGGNet [
28] was developed by the University of Oxford. VGGNet consists of up to sixteen convolution layers. Unlike AlexNet, which uses a variety of kernel sizes, VGGNet reduces the number of parameters by using a fixed kernel size of 3 × 3. The nonlinearity in a network increases with the increase in the number of convolution layers. This aids in extracting more unique features.
A limitation with deep learning is that it may be trained poorly due to the slope loss as the architecture gets deeper. To solve this problem, ResNet [
29] introduced the method of a block structure as shown in
Figure 2 in which the input layer and the output layer are connected to each other. This structure is called the residual structure, and it is a structure that learns to minimize the difference between the input and output. Through this structure, the problem of slope disappearance is solved, and the training result is compared and evaluated by increasing the layers of the neural network.
Further, DenseNet [
30] was introduced in 2017, where each layer takes all preceding feature maps as the input. Unlike ResNet, DenseNet has a structure in which an input layer and an output layer having the same feature map size are directly connected to each other and transmitted as the input values of the next layer as shown in
Figure 3. It has small parameters and fewer computations with better performance than the state-of-the-art.
4. Experiments
Our dataset consisted of three types of images, which were used for training and testing our network. Sample images from our dataset can be seen in
Figure 7, where
Figure 7a shows colonoscopy images of a normal person,
Figure 7b shows colonoscopy images containing an adenomatous polyp, and
Figure 7c shows colonoscopy images containing a cancerous adenomatous polyp. As shown in
Figure 7, the colon of a normal person is without any polyps in the mucosa. When a polyp develops in the colon of a normal person (see
Figure 7b), it can develop into a relatively large adenomatous polyp (see
Figure 7c). Adenocarcinoma of the adenomatous polyps develops into malignant tumors and become cancer. The shape and size of the polyps developed by the cancer appear in various forms without any specific rules (see
Figure 7c).
4.1. Experimental Data
Our original image dataset consisted of 449 cancer images, 626 adenomatous polyp images, and 773 normal images as shown in
Table 1. However, this meager amount of data was insufficient for a deep neural network. To tackle this issue, we used data augmentation to increase the dataset size. Specifically, each image was rotated at various angles between 10° and 360°. After data augmentation, we obtained 16,609 adenocarcinoma images, 16,616 adenomatous polyposis images, and 16,233 normal images as shown in
Table 1. Our final dataset consisted of a total of 49,458 endoscopic images. Our test dataset consisted of 140 cancer images, 142 adenomatous polyp images, and 128 normal images as shown in
Table 2.
4.2. Experiments of Convolution Layer
As described above, the effect of the convolution differed according to the number of convolutions and the pooling phase. In this paper, In order to find an optimal model for colon endoscopy recognition, we performed experiments in which the convolution layer structure changed as shown in
Table 3. We proposed the number of layers that had the highest result by changing the number of the convolutional layers after each pooling step. In the first experiment of the convolution layers, we experimented to increase the number of convolution layers by 4, 6, and 8 configurations, which were the same numbers of convolution layers after each pooling step. The second experiment of the convolution layers was an experiment in which the number of convolution layers was increased around a specific pooling step.
As a result of the first experiment, the accuracy was gradually increased with the increasing number of convolution layers as shown in
Table 4. This showed that the result was better as the number of convolution layers increased, but when the number of convolution layers was further increased, the performance deteriorated due to over-fitting. In a second experiment in which the convolution layers were increased around a particular scale, high accuracy was demonstrated when the largest number of convolution layers was constructed after the second pooling step.
4.3. Experimental Evaluation
This section shows various metrics used for evaluation during the experiments on the testing sets. If a model correctly predicted the positive class, then it is known as a True Positive (TP). Similarly, if a model predicts the negative class correctly, then it is known as a True Negative (TN). On the other hand, if a model incorrectly predicts a positive class, then it is referred to as a False Positive (FP). A False Negative (FN) is when the model incorrectly predicts the negative class. The accuracy, precision, and sensitivity are calculated using Equation (4).
4.4. Network Training
The implementation was based on Keras with a TensorFlow backend. We used stochastic gradient descent with a batch size of 8 for all methods. The learning rate started from 0.001 and decreased by a factor of 0.1 when the tolerance level exceeded 4. We used a weight decay of 0.0001 and a momentum of 0.9 without an accelerated gradient. Overall training was done for different network architectures on a single NVIDIA GTX 1080 Ti GPU. We fine-tuned DenseNet-121, ResNet-152, and VGG16 using RMSProp [
31] with a decay of 0.9. Each network was trained for 100 epochs.
Figure 8 shows the inference time during training each architecture with same number of batches. As can be seen in the figure, the proposed method processed fewer images, but it had fewer parameters in comparison to other architectures. Due to the increasing number of channels added in the existing CNNs, its complexity overfit the model. However, we used the same number of channels in all layers and increased the convolutional layer, which made the method less complex.
Table 5 shows the overview of each network’s computation capability with its parameters.
4.5. Performance Evaluation
In this section, we compare the performance of the proposed method with the existing CNN method due to the limited work on colorectal diseases using deep learning. We first evaluated different layers of the CNN as shown in
Table 4. Next, we evaluated the proposed method with the baseline architectures with the same number of parameters. The results of our experimentation and evaluation are summarized in
Table 6,
Table 7,
Table 8 and
Table 9.
Table 9 shows the confusion matrix results for the test data, whereas
Table 6 shows the sensitivity, precision, and accuracy of the test results.
In the experimental results with 128 normal images, there were 8 TN cases, which showed food and excrement in colonic mucosa and colorectal crescentic wrinkles similar to polyps. Further, it was also observable that the problem that had the biggest impact on the overall error rate was images with small-sized polyps that were difficult to detect in the normal area. It showed a 4.6% error rate in the experimental results. This was the most error of the total error rate of 5.61%. In all the test data, adenoma images showed the lowest precision by 91.21%, but adenocarcinoma images showed the highest accuracy of 97.05%. The accuracy of the whole image was 94.39%.
The comparison with the well-known network in the ImageNet Challenge is shown in
Table 9. The test result showed 87% on VGG19 [
28], 90% on ResNet [
29], and 89% on DenseNet [
30] when trained with the same dataset.
5. Discussion
In this work, we presented an automated system to classify colorectal diseases with high accuracy. The experimental evaluation showed that the proposed method could accurately differentiate high-risk polyps and adenocarcinoma effectively in the endoscopic domain. This method leveraged the VGG architecture and enabled the development of effective models with high accuracy for colorectal images in comparison to existing approaches. Although this best-performing model processed fewer images per second during inference, it was more important to classify the diseases more accurately. The availability of fewer data in the medical domain made it difficult for the CNN model to converge. However, our proposed method had much fewer parameters and converged fast when the dataset had fewer images. This technology will improve the quality of colorectal cancer screening and performance if combined with endoscopic experts.
One drawback of our method for endoscopic characterization was the black box approach to the results. Therefore, the visualization method in the network after or during training needs to be developed for improvement. It will surely help doctors or medical experts to gain insight into the influential regions and features in the image. Beside this, we plan to compare our performance with medical experts and validate the efficacy of the method in clinical practice.
6. Conclusions
In this paper, we used our proposed deep neural network architecture to recognize normal, adenomatous polyps, and adenocarcinoma in colonoscopy images. We studied the effect of the addition of convolutional layers in the network, and based on this, we proposed a convolutional neural network architecture that consisted of a total of 43 convolutions and one fully-connected layer. To evaluate our network, we calculated the sensitivity, precision, and accuracy. With extensive experiments and evaluation, it was proven that our method was more accurate and able to extract features from the colorectal images. In the future, the endoscopy diagnosis system will be developed and improved.