1. Introduction
According to a report from the World Health Organization in 2018, there were about 9.6 million deaths from cancer globally, of which 1.76 million cases were attributed to lung cancers [
1]. Studies have identified environmental factors and smoking as major causes of lung cancer [
2]. Generally, chest X-ray, computed tomography (CT) and magnetic resonance imaging are modalities used to evaluate lung cancer [
3,
4]. The chest X-ray is the first test in diagnosing lung cancer. It indicates abnormal formations in the lungs. Compared to a chest X-ray, a CT scan can show a more detailed view of the lungs and can also show the exact shape, size, and location of formations. A CT scan is therefore a major diagnostic tool for the assessment of lung cancer. To reduce the workload of analyzing CT images manually and to avoid subjective interpretations, machine learning techniques are applied to computer-aided design systems for objectively auxiliary diagnosis. Lately, due to the rapid growth of deep learning, convolutional neural networks (CNNs) not only show good performance in image classification and object detection tasks [
5,
6,
7], but are also widely used in several applications, including smart homes, driverless cars, manufacturing robots, drones, and chat robots. The studies of CNNs have been continually innovating and improving.
In 1998, LeCun et al. proposed LeNet-5 [
8], a simple CNN for handwritten digits classification. The LeNet-5 comprises a feature extraction part convolutional layers and pooling layers and a classification part fully connected layer. Subsequently, in 2012, Krizhevsky et al. proposed AlexNet [
9] and won the ImageNet Large Scale Visual Recognition Competition. AlexNet replaces Sigmoid and Tanh activation function with a rectified linear unit. It also introduces Dropout and max pooling which are different from LeNet-5. In 2014, Szegedy et al. proposed GoogLeNet [
10] and proposed Inception module which uses three different size of convolutional kernels simultaneously for extracting more features in one layer. In the same year, Simonyan et al. proposed VGGNet model [
11]. VGGNet adopts 3 × 3 stacked convolutional layers and deepening the depth of the network as well as the number of input and output channel of layers. In 2015, He et al. proposed ResNet [
12] and introduced a residual block alleviating the degradation problem of the deep network. Certainly, more and more architectures are innovated. However, unbalanced or sparse data sets and network parameter settings for training are two major problems faced by deep learning.
Unbalanced data, especially medical images, is one of the most challenging tasks in deep learning [
13,
14,
15,
16,
17]. Typical data augmentation methods include translation, rotation, flipping, and zooming [
18,
19]. However, those geometric transformations might not be able to provide sufficient data diversity. In 2014, generative adversarial networks (GANs) [
20] were proposed to tackle the problem of sparse data. This model consists of two networks: one generator network and one discriminator network. The generator network aims to generate plausible fake images. On the other hand, the discriminator network distinguishes real data from the data created by the generator or real data and acts as a classifier. In 2015, deep convolutional GANs (DCGANs), a direct extension of GANs, were proposed [
21] which replaced the original convolutional layers by transposed convolutional layers. Later, many studies discussed the complementary data process techniques in medical applications. Perez et al. [
22] investigated the impact of 13 data augmentation scenarios, such as traditional color and geometric transforms, elastic transforms, random erasing, and lesion mixing method for melanoma classification. The results confirmed that data augmentation can lead to more performance gains than obtaining new images. Madani et al. [
23] implemented GANs for producing chest X-ray images to augment a dataset and showed higher accuracy for normal vs abnormal classification in chest X-rays.
In the other hand, selecting a better network parameter combination is another time-consuming task. Several experiments are required for determining the optimum parameter combination. To reduce the time cost, many studies have been proposed for network parameter optimization methods. Real et al. [
24] introduced genetic algorithm into CNN architecture and achieved high accuracy in both CIFAR-10 and CIFAR-100 data sets. An autonomous and continuous learning algorithm proposed by Ma et al. [
25] could automatically generate deep convolutional neural network (DCNN) architectures by partition DCNN into multiple stacked meta convolutional blocks and fully connected blocks then used genetic evolutionary operations to evolve a population of DCNN architectures. Although those methods showed high accuracy, they are still time consuming. The Taguchi method proposed by Dr. Genichi Taguchi has been widely applied as a design method [
26,
27,
28]. It is not only straightforward and easy to implement in many engineering situations but also able to narrow down the scope of a research project quickly.
In the present study, the main contributions are to alleviate the problem of sparse medical images and to use a parameter optimizer to select an optimal network parameter combination in fewer experiments based on the state-of-art CNNs for providing an accurate and a general applicable lung tumor classification. Firstly, GAN was introduced to augment CT images in order to increase the data diversity for improving the accuracy of CNNs and AlexNet architecture was chosen as the backbone classification network with a parameter optimizer which is capable to select a better parameter combination in fewer experiments for achieving the goals of the present study. The rest of this paper is organized as follows.
Section 2 describes a data augmentation method to increase lung tumor CT images.
Section 3 reviews CNN architecture and introduces the network parameter optimizer. The experimental results and discussions are detailed in
Section 4.
Section 5 draws the conclusion.
3. CNN Architecture and Parameter Optimizer
This section reviews a CNN architecture and describes how CNN parameters can be adjusted by using the parameter optimizer.
Figure 5 illustrates the flowchart of parameter optimization process.
3.1. CNNs
CNNs are the most commonly modalities used for image recognition and usually consist of three parts: convolutional, pooling, and fully connected (FC) layers. The convolutional and pooling layers are the most crucial parts for extracting global and local features.
3.1.1. Convolutional Layer
The convolutional layer (C) contains several kernels which are used to extract features from images. Each convolutional layer is covered by kernels with various weight combinations. The kernel performs convolution operations through a sliding approach to generate feature maps. Then, the inner product between the input kernels at each spatial position is calculated. Finally, the output of the convolutional layer is obtained by stacking the feature maps of all kernels in the depth direction.
3.1.2. Pooling Layer
The objective of using a pooling layer (Pool) is to reduce the size of feature maps without losing important feature information and reduce subsequent operations. Pooling can be performed using several methods, including average and max pooling. Average pooling calculates the average value within the selected patch from the feature map. Contrarily, max pooling calculates the maximum value within the selected patch from the feature map. In addition, padding (P) is seldom applied in the pooling layer. Also, the pooling layer does not generate trainable variables.
3.1.3. Activation Function
In neural networks, each neuron is connected to other neurons in order to passing the signal from an input layer to an output layer in one direction. The activation layer relates to the forward propagation of the signal through the network. The purpose of the activation function is to substitute the nonlinear function into the output of the neuron to solve complex nonlinear problems. Sigmoid, tanh, and ReLU are common activation functions, with ReLU being among the most widely used. ReLU, as expressed in Equation (1), is also used as an activation function for addressing the vanishing gradient problem and it can reduce the degree of overfitting, as displayed in
Figure 6.
3.1.4. Fully Connected Layer
The fully connected (FC) layer is functioned as a classifier. The FC layer converts the two-dimensional feature map output by the convolution layer into a one-dimensional vector. The final probability of each label is obtained using Softmax.
LeNet-5 and AlexNet contain fewer layers and simple architecture compared with other deeper CNNs. Among them, AlexNet has not only been presented good performance in many applications, but also allows color images as input, such as computed tomography images. Therefore, with data augmentation and parameter optimizer implementation, AlexNet might be a suitable network architecture used in this study. AlexNet consists of five convolutional layers, three pooling layers, three FC layers, and Softmax with 1000 outputs. The aim of this study was to classify lung CT images into benign or malignant tumors. Thus, the transfer learning technique was applied to change the last FC layer to two outputs. The AlexNet architecture is illustrated in
Figure 7 and
Table 3 lists the details of the AlexNet.
3.2. Parameter Optimization
Selecting an optimal network parameter combination is a time-consuming task. In this study, the objective is to investigate the performance of CNNs using parameter optimization. The Taguchi method is a low-cost, high-efficiency quality engineering method that emphasizes improving product quality through design experiments. Therefore, the Taguchi method was applied for the parameter optimization of CNNs.
First, the objective function is defined. Then, the factors and levels that affect the objective function are selected. The orthogonal array and the signal-to-noise ratio (S/N ratio) are the two main indicators in the Taguchi method. The orthogonal array is used to determine the number of times the experiment needs and allocate experimental factors into an orthogonal array. Additionally, the S/N ratio is used to verify whether the CNN parameters are the optimal parameter combination. Finally, according to the experimental results the optimal key factors and levels are decided. Although the cost-effectiveness of the experiment is an issue, the optimal combination of factors and levels can be found.
Figure 8 displays the flowchart of the Taguchi method.
Understand the task to be completed. Here, the CNN parameters, including kernel size (KS), stride (S), and padding (P), were tasks that needed to be optimized in order to achieve higher accuracy in fewer experiments.
Select factors and levels. In AlexNet, the first convolutional layer involves global feature extraction, and the fifth convolutional layer involves local feature extraction of the input image. Therefore, KS, S, and P of the first and fifth convolutional layers were adjusted by Taguchi method. The factors are: kernel size (C1-KS), stride (C1-S), and padding (C1-P) of the first convolutional layer, kernel size (C5-KS), stride (C5-S), and padding (C5-P) of the fifth convolutional layer. The levels are assigned according to the parameters commonly used in the state-of-art CNNs as shown in
Table 4.
Choose an appropriate orthogonal array. The orthogonal array provides statistical information with fewer experiments. After the factors and levels selection, the appropriate orthogonal array should be chosen based on the factors and the levels. In this study, C1-P had two levels, and C1-KS, C1-S, C5-KS, C5-S, and C5-P had three levels. The total degree of freedom in the experiment is 11, therefore, the orthogonal array is selected. Initially, the selected factors and levels required 486 (3 × 3 × 2 × 3 × 3 × 3) experiments, while using the orthogonal array the scope of experiments was reduced to only 18 experiments.
Fill in the
orthogonal array with the factors and levels designed in
Table 4. The complete
orthogonal array is presented in
Table 5.
Perform 18 experiments based on the orthogonal array. In this study, each experiment was tested five times to get an overall accuracy.
Calculate the S/N ratio and analyze the experimental data.
Accurate classification of lung tumor images is the purpose of this study. Hence, a higher S/N ratio indicates that the parameter combination is optimal and is able to provide superior performance.
Finally, use the acquired optimal parameter combination to train AlexNet again to verify that the optimal parameter combination is able to improve the accuracy of this network.