Next Article in Journal
A Reliable Weighting Scheme for the Aggregation of Crowd Intelligence to Detect Fake News
Previous Article in Journal
Recommender Systems Based on Collaborative Filtering Using Review Texts—A Survey
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach

1
Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22904, USA
2
Office of Health Informatics and Analytics, University of California, Los Angeles (UCLA), CA 90095, USA
3
Sensing Systems for Health Lab, University of Virginia, Charlottesville, VA 22911, USA
4
Department of Pediatrics, School of Medicine, University of Virginia, Charlottesville, VA 22903, USA
5
Department of Paediatrics and Child Health, The Aga Khan University, Karachi 74800, Pakistan
6
Tropical Gastroenterology and Nutrition Group, University of Zambia School of Medicine, 32379 Lusak, Zambia
7
Blizard Institute, Barts and The London School of Medicine, Queen Mary University of London, London E1 4NS, UK
8
School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
*
Authors to whom correspondence should be addressed.
Information 2020, 11(6), 318; https://doi.org/10.3390/info11060318
Submission received: 6 May 2020 / Revised: 9 June 2020 / Accepted: 10 June 2020 / Published: 12 June 2020
(This article belongs to the Section Information Processes)

Abstract

:
Image classification is central to the big data revolution in medicine. Improved information processing methods for diagnosis and classification of digital medical images have shown to be successful via deep learning approaches. As this field is explored, there are limitations to the performance of traditional supervised classifiers. This paper outlines an approach that is different from the current medical image classification tasks that view the issue as multi-class classification. We performed a hierarchical classification using our Hierarchical Medical Image classification (HMIC) approach. HMIC uses stacks of deep learning models to give particular comprehension at each level of the clinical picture hierarchy. For testing our performance, we use biopsy of the small bowel images that contain three categories in the parent level (Celiac Disease, Environmental Enteropathy, and histologically normal controls). For the child level, Celiac Disease Severity is classified into 4 classes (I, IIIa, IIIb, and IIIC).

1. Introduction and Related Works

Automatic diagnosis of diseases based on medical image categorization has become increasingly challenging over the last several years [1,2,3]. Areas of research involving deep learning architectures for image analysis have grown in the past few years with an increasing interest in their exploration and understanding of the domain application [3,4,5,6,7]. Deep learning models achieved state-of-the-art results in a wide variety of fundamental tasks such as image classification in the medical domain [8,9]. This growth has raised questions regarding classification of sub-types of disease across a range of disciplines including Cancer (e.g., stage of cancer), Celiac Disease (e.g., Marsh Score Severity Class), and Chronic Kidney Disease (e.g., Stage 1–5) among others [10]. Therefore, it is important to not just label medical images-based specialized areas, but to also organize them within an overall field (i.e., name of disease) with the accompanying sub-field (i.e., sub-type of disease) which we have done in this paper via Hierarchical Medical Image Classification (HMIC). Hierarchical models also combat the problem of unbalanced medical image datasets for training the model and have been successful for other domains [11,12].
In the literature, few efforts have been made to leverage the hierarchical structure of categories. Nevertheless, hierarchical models have shown better performance compared to flat models in image classification across multiple domains [13,14,15]. These models exploit the hierarchical structure of object categories to decompose the classification tasks into multiple steps. Yan et al. proposed HD-CNN by embedding deep CNNs into a category hierarchy [13]. This model separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers. In a CNN, shallow layers capture low-level features while deeper layers capture high level ones. Zhu and Bain proposed Branch Convolutional Neural Network (B-CNN) [16] based on this characteristic of CNNs. This model instead of employing different classifiers for different levels of class hierarchy, exploits the hierarchical structure of layers in a CNN and embeds different levels of class hierarchy on a single CNN. B-CNN outputs multiple predictions ordered from coarse to fine along concatenated convolutional layers corresponding to hierarchical structure of the target classes. Sali et al. employed B-CNN model for the classification of gastrointestinal disorders on histopathological images [17].
Our paper uses the HMIC approach for assessment of small bowel enteropathies; Environmental Enteropathy (EE) versus Celiac Disease (CD) versus histologically normal controls. EE is a common cause of stunting in Low-to-Middle Income Countries (LMICs), for which there is no universally accepted, clear diagnostic algorithms or non-invasive biomarkers for accurate diagnosis [18], making this a critical priority [19]. Linear growth failure (or stunting) is associated with irreversible physical and cognitive deficits, with profound developmental implications [18]. Interestingly, CD, a common cause of stunting in the United States, with an estimated 1% prevalence, is an autoimmune disorder caused by a gluten sensitivity [20] and has many shared histological features with EE (such as increased inflammatory cells and villous blunting) [18]. This resemblance has led to the major challenge of differentiating clinical biopsy images for these similar but distinct diseases. CD severity is further assessed via Modified Marsh Score Classification. It takes into account the architecture of the duodenum as having finger-like projections (called “villi”) which are lined by cells called epithelial cells. Between the villi are crevices called crypts that contain regenerating epithelial cells. Normal villus to crypt ratio is between 3:1 and 5:1 and a healthy duodenum (first part of the small intestine) has no more than 30 lymphocytes interspersed per 100 epithelial cells within the villus surface layer (epithelium). Marsh I comprises of normal villus architecture with an increase in the number of intraepithelial lymphocytes. Marsh II has increased intraepithelial lymphocytes along with crypt hypertrophy (crypts appear enlarged). This is usually rare since patients typically rapidly progress from Marsh I to IIIa. Marsh III is sub-divided into IIIa (partial villus atrophy), Marsh IIIb (subtotal villus atrophy) and Marsh IIIc (total villus atrophy) along with crypt hypertrophy and increased intra-epithelial lymphocytes. Finally, in Marsh IV, villi are completely atrophied [21].
The HMIC approach is shown in Figure 1. The parent level is a model trained based on the parent level of data; EE, CD or Normal. The child level model is trained for sub-classes of CD based on Modified Marsh Score based on severity; I, IIIa, IIIb, and IIIc).
The rest of this paper is organized as follows: In Section 2, the different data sets used in this work, as well as, the required pre-processing steps are described. The architecture of the model is explained in Section 5. Empirical results are elaborated in Section 6. Finally, Section 7 concludes the paper along with outlining future directions.

2. Data Source

As shown in Table 1, the biopsies were already obtained from 150 children in this study with a median (interquartile range) age of 37.5 (19.0 to 121.5) months and a roughly equal sex distribution; 77 males ( 51.3 % ) , and LAZ/ HAZ (Length/ Height-for-Age Z score) of the EE participants were 2.8 (inter-quartile range ( I Q R ) : 3.6 to 2.3 ) and 3.1 (IQR: 4.1 to 2.2 ). LAZ/ HAZ of the Celiac participants were 0.3 (IQR: 0.8 to 0.7). and LAZ/ HAZ for Normal were 0.2 (IQR: 1.3 to 0.5). Duodenal biopsy samples were developed into 461 whole-slide biopsy images and labeled as either Normal, EE, or CD. The biopsy slides for EE patients were collected from the Aga Khan University Hospital (AKUH) in Karachi, Pakistan ( n = 29 slides from 10 patients), and the University of Zambia Medical Center in Lusaka, Zambia ( n = 16 ). The slides for Normal patients ( n = 63 ) and CD ( n = 34 ) were collected from The University of Virginia (UVa). Normal and CD slides were transformed into a whole-slide at 40 × amplification using the Leica SCN 400 slide scanner (Meyer Instruments, Houston, TX, USA) at UVa, and the digitized EE slides of 20 × and shared by means of the Environmental Enteric Dysfunction Biopsy Investigators (EEDBI) Consortium shared WUPAX server. The patient populace is as per the following:
The median age of ( Q 1 , Q 3 ) of our whole investigation populace was 37.5 ( 19.0 , 121.5 ) months, and we had a generally equivalent dispersion of females (48%, n = 49 ) and males (52%, n = 53 ). Most of our examination populace were histologically Normal controls ( 37.7 % ) , followed by CD patients ( 51.8 % ) , and EE patients ( 10.05 % ) .
239 Hematoxylin and eosin (H&E) stained duodenal biopsy samples were collected from the archived biopsies of 63 CD patients from the University of Virginia (UVa) in Charlottesville, VA, USA. The sample were converted into whole-slide images at 40× magnification using the Leica SCN 400 slide scanner (Meyer Instruments, Houston, TX, USA) at the Biorepository and Tissue Research Facility at UVa. The median age of the UVa patient populace is 130 months with interquartile ranges of 85.0 and 176.0 months for Q 1 and Q 3 , respectively. UVa images had a generally equivalent circulation of females ( 54 % , n = 54 ) and male ( 46 % , n = 29 ) . The biopsy labels for this research were determined by two clinical experts and approved by a pathologist with considerable authority in gastroenterology. Our dataset is ranged from Marsh I to IIIc with no biopsy declared as Marsh II.
Based on Table 2, the biopsy images are patched in to 91,899 total images which contain 32,393 normal patches, 29,308 EE patches, and 30,198 CD patches. In the child level of the medical biopsy patches, CD contains 4 severities of disease (Type I, IIIa, IIIb, and IIIc) which has 7125 Type I patches, 6842 Type IIIa patches, 8120 Type IIIb patches, and 8111 Type IIIb patches. The training set for normal and EE contains 22,676 and 20,516 patches, respectively, and for testing 9717 and 8792 patches, respectively. For CD, we have two sets of training and testing where one belongs to the parent model and the other belongs to child level. The parent set contains 21,140 patches for training and 9058 image patches for testing with the common label of CD for all. In the CD child dataset, we have four severity types of this disease (I, IIIa, IIIb, and IIIc). Type I of CD contains 4988 patches in the training set and 2137 patches in the test set. Type IIIa of CD contains 4790 patches in the training set and 2052 patches in the test set. Type IIIb of CD contains 5684 patches in the training set and 2436 patches in the test set. Finally, IIIc of CD contains 5678 patches in the training set and 2137 patches in the test set.

3. Pre-Processing

In this section, we explain the entirety of the pre-processing steps which includes medical image patching, image clustering to remove useless information, and color balancing to solve the staining problem. The biopsy images are unstructured, can vary in size, and are often very high resolution to even consider processing with deep neural systems. Therefore, it becomes necessary to tile the whole-slide images into smaller image subsets called patches. Many of the patches created after tiling the whole-slide image will not contain useful biopsy tissue data. For example, some patches only contain the white or light-gray background area. In the image clustering section, the process to select useful images is described. Lastly, color balancing is used to address staining problems which is a typical issue in histological image preparation.

3.1. Image Patching

Although the effectiveness of CNNs in image classification has been shown in various studies in different domains, training on high-resolution Whole Slide Tissue Images (WSI) is not commonly preferred due to a high computational cost. Applying CNNs on WSI can also lead to losing a large amount of discriminative data because of severe down-sampling [22]. Due to cellular level contrasts between Celiac Disease, Environmental Enteropathy, and Normal cases, an image classification model performed on patches can perform at least similarly to a WSI-level classifier [22].For this study, patches are labeled with the same class as the associated WSI. The CNN models are trained to predict the presence of disease or disease severity at the patch-level.

3.2. Clustering

As shown in Figure 2, after each biopsy the whole image is divided into patches; many of these patches are not useful input for a deep image classification model. These patches tend to contain only connective tissue, are located on the border region of the tissue, or consist entirely of image background [2]. A two-stage clustering process was applied to recognize the immaterial patches. For the initial step, a convolutional autoencoder was used to learn a vectorized representation of features of each patch and in the second step, we used k-means clustering to assign patches into two groups: helpful and not useful patches. In Figure 3, the pipeline of our clustering strategy is depicted which contains both the autoencoder and k-means clustering.

3.2.1. Autoencoder

An autoencoder is a form of a neural network that is intended to output a reconstruction of the model’s input [23]. The autoencoder has achieved incredible success as a dimensionality reduction technique [24]. The primary version of the autoencoder was presented by DE. Rumelhart et al. [25] in 1985. The fundamental concept is that one hidden layer acts as a bottle-neck and has far fewer nodes than other layers in the model [26]. This condensed hidden layer can be used to represent the important features of the image with a smaller amount of data. With image inputs, autoencoders can convert the unstructured data into feature vectors that can be processed through other machine learning methods such the k-means clustering algorithm.

Encode

A CNN-based autoencoder can be isolated into two principle steps [27]: encoding and interpreting. This condition is:
O m ( i , j ) = a ( d = 1 D u = 2 k 1 2 k + 1 v = 2 k 1 2 k + 1 F m d ( 1 ) ( u , v ) I d ( i u , j v ) ) m = 1 , , n
where F { F 1 ( 1 ) , F 2 ( 1 ) , , F n ( 1 ) , } is a convolutional filter, with convolution among an input volume defined by I = I 1 , , I D which it learns to represent the input by combining non-linear functions:
z m = O m = a ( I F m ( 1 ) + b m ( 1 ) ) m = 1 , , m
where b m ( 1 ) is the bias, and the number of zeros we want to pad the input with is such that: dim(I) = dim(decode(encode(I))). Finally, the encoding convolution is equal to:
O w = O h = ( I w + 2 ( 2 k + 1 ) 2 ) ( 2 k + 1 ) + 1 = I w + ( 2 k + 1 ) 1

Decode

The decoding convolution step produces n feature maps z m = 1 , , n . The reconstructed results I ^ is the result of the convolution between the volume of feature maps Z = { z i = 1 } n and this convolutional filters volume F ( 2 ) [28,29].
I ˜ = a ( Z F m ( 2 ) + b ( 2 ) )
O w = O h = ( I w + ( 2 k + 1 ) 1 ) ( 2 k + 1 ) + 1 = I w = I h
where Equation (5) shows the decoding convolution with I dimensions. The input’s dimensions are equal to the output’s dimensions.

3.2.2. K-Means

K-means clustering is one of the most popular clustering algorithms [30,31,32,33,34] for data in the form D { x 1 , x 2 , , x n } in d dimensional vectors for x f d . K-means had been applied to perform image and data clustering for information retrieval [30,35,36]. The aim is to identify groups of similar data points and assign each point to one of the groups. There are many other clustering algorithms, but the k-means approach works well for this problem, because there are only two clusters and it is computationally inexpensive compared to other methods.
As an unsupervised approach, one measure of effective clustering is to sum the distances of each data point from the centroids of the assigned clusters. The goal of K-means is to minimize ξ , the sum of these distances, by determining optimal centroid locations and cluster assignments. This algorithm can be difficult to optimize due to the volatility of cluster assignments as the centroid locations change. Therefore, the K-means algorithm is a greedy-like approach that iteratively adjusts these locations to solve the minimization.
Minimize ξ with respect to A and μ by:
ξ = j = 1 k x i | | x i μ j | | 2 = j = 1 k i = 1 n A i j | | x i μ j | |
where x i are values from the autoencoder feature representation, μ j is the centroid of each cluster, and A i j is the cluster assignment of each data point i with cluster j. A i j can only take on binary values and each data point can only be assigned to a single cluster.
The centroid μ of each cluster is calculated as follows:
μ ( w ) = 1 | w | x ¯ w x ¯
Finally, as shown in Figure 4, all patches are assigned into two clusters which one of them contains useful information and the other one is empty or does not have medical information. The Algorithm 1 indicates kmeans algorithm for two clusters medical images.
Algorithm 1 K-means algorithm for 2 clusters medical images
Information 11 00318 i001

3.3. Medical Image Staining

Hematoxylin and eosin (H&E) stains have been used for at least a century and are still essential for recognizing various tissue types and the morphologic changes that form the basis of contemporary CD, EE, and cancer diagnosis [37]. H&E is used routinely in histopathology laboratories as it provides the pathologist/researcher a very detailed view of the tissue [38]. Color variation has been a very important problem in histopathology based on light microscopy. A range of factors makes this problem even more complex such as the use of different scanners, variable chemical coloring/reactivity from different manufacturers/batches of stains, coloring being dependent on staining procedure (timing, concentrations, etc.), and light transmission being a function of section thickness [39]. Different H&E staining appearances within machine learning inputs can cause the model to focus only on the broad color variations during training. For example, if images with a certain label all have a unique stain color appearance, because they all originated from the same location, the machine learning model will likely leverage the stain appearance to classify the images rather than the important medical cellular features.

3.3.1. Color Balancing

The idea of color balancing for this study is to convert images in to a similar color space to represent variations in H&E staining. The images can be represented with the illuminant spectral power distribution as shown by I ( λ ) , the surface spectral reflectance S ( λ ) , and the C ( λ ) is sensor spectral sensitivities [40,41]. Using these notations [41], the sensor reactions at the pixel with coordinates of ( x , y ) which can be presented as:
p ( x , y ) = w I ( x , y , λ ) S ( x , y , λ ) C ( λ ) d λ
where w is the wavelength range of the visible light spectrum, p and C ( λ ) are three-component vectors.
R G B o u t = α a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 × r i 0 0 0 g i 0 0 0 b i R G B i n γ
where R G B i n stand for the raw images from medical images, and the diagonal matrix diag( r i , g i , b i ) is the channel-independent gain compensation of the illuminant [41]. In addition, R G B o u t is output results that be send to input feature space of CNN models. γ is the gamma correction defined for the RGB color space and RGB o u t are the output RGB values. In the following, a more compact version of Equation (9) is used:
R G B o u t = ( α A I w . R G B i n ) γ
where α stand for exposure compensation gain, and the diagonal matrix for the illuminant compensation shows by I w and the color matrix transformation is shown by matrix A which is a diagonal matrix for the illuminant compensation and the color matrix transformation [41].
Figure 5 indicates the output results of three classes (CD, EE, and Normal) for color balancing (CB) with various color balancing percentage in range between 0.01 and 50.

3.3.2. Stain Normalization

Histological images can have significant variations in stain appearance that will cause biases during model training [1]. The variations occur due to many factors such as contrasts in crude materials and assembling procedures of stain vendors, staining conventions of labs, and color reactions to digital scanners [1,42]. To solve this problem, the stains of all images are normalized to a single stain appearance. Different staining normalization approaches have been proposed in research projects. In this paper, we used the methodology proposed by Vahadane et al. [42] for the CD severity child-level since all images are collected from one center. This methodology is designed to preserve the structure of cellular features of images after stain normalization and accomplishes stain separation with non-negative matrix factorization. Figure 6 shows an example outputs before and after applying this method on biopsy patches.

4. Baseline

4.1. Deep Convolutional Neural Networks

A Convolutional Neural Network (CNN) performs hierarchical medical image classification for each individual image. The original version of the CNN was built for image processing with an architecture similar to the visual cortex. In this basic CNN baseline for image processing, an image tensor is convolved with a set of d × d kernels size. These convolution layers are called feature maps and these provide multiple filters which could be stacked on the input. We used a flat CNN (non-hierarchical CNN) as one of our baselines.

4.2. Deep Neural Networks

A Deep Neural Network (DNN) or multilayer perceptron is designed to be trained by multiple layers of connections. Each individual hidden layer can receive connection from the previous hidden layers’ nodes and only can provide connections to the next layer. The input is a connection of flattened feature space (RGB). The output layer is number of classes for multi-class classification (six nodes). Our baseline implementation of DNN (multilayer perceptron) is a discriminative trained model that uses a standard back-propagation algorithm with sigmoid (Equation (12)) and Rectified Linear Units (ReLU) [43] (Equation (13)) activation functions. The output layer for classification task uses the S o f t m a x function due to having multi-class output as shown in Equation (14).

5. Method

In this section, we explain our concept of Deep Convolutional Neural Networks (CNN) containing the convolutional layers, activation functions, pooling-layers, and finally, the optimizer. Then, we describe our Deep Convolutional Neural Networks architecture to diagnose Celiac disease and environmental enteropathy. As shown in Figure 7, the input layer consists of image patches with size of ( 1000 × 1000 pixels) and it follows the connection to the convolutional layer (Conv 1). Conv 1 connects to the its following pooling layer (MaxPooling). The pooling layer is connected to second convolutional layer Conv 2. The last convolutional layer (Conv 3) has been flattened and connected to a fully connected multi-layer perceptron. The final layer includes three nodes where each individual node represents one class.

5.1. Convolutional Neural Networks

5.1.1. Convolutional Layer

Convolutional Neural Networks are deep learning models that can be used for the hierarchical classification tasks, especially, image classification [44]. Initially, CNNs were designed for image and computer vision with a similar design as the visual cortex. CNNs have been used successfully for clinical image classification. In CNNs, an image tensor is convolved with set of d × d kernels. These convolutions (“Feature Maps”) can be stacked to represent many different features detected by the filters in that layer. The feature dimensions of output and input networks can be different [45]. The procedure for processing a solitary output of a matrix is characterized as follows:
A j = f i = 1 N I i K i , j + B j
Each individual matrix I i is convolved with its corresponding kernel matrix K i , j , and bias of B j . Finally, a activation function (non-linear activation function is explained in Section 5.1.3) is applied to each individual element [45].
The biases and weights are adjusted to constitute competent feature detection filters after the back-propagation step during CNN training. The feature map filters are applied across all three channels [46].

5.1.2. Pooling Layer

To diminish the computational multifaceted nature, CNNs use pooling layers which decrease the size of the output layer from its input with one layer then onto the next in the networks. Distinctive pooling procedures are used to decrease output while safeguarding significant features [47]. The most widely recognized pooling technique is a max-pooling technique where the largest activation is chosen in the pooling window.

5.1.3. Neuron Activation

The CNN is implemented as a discriminative method that uses a back-propagation algorithm derived from sigmoid (Equation (12)), or (Rectified Linear Units (ReLU) [43] (Equation (13)) activation functions. The final layer contains one node with sigmoid activation function for binary classification multiple nodes for each class and a S o f t m a x activation function for multi-class problems (as demonstrated in Equation (14)).
f ( x ) = 1 1 + e x ( 0 , 1 )
f ( x ) = max ( 0 , x )
σ ( z ) j = e z j k = 1 K e z k j { 1 , , K }

5.1.4. Optimizer

For our CNN architecture, we use the A d a m optimizer [48]. This is a stochastic gradient descent that uses the norm of the initial two moments of gradient (v and m, appeared in Equations (15)–(18)). It can deal with non-stationarity of the target in a similar fashion to RMSProp, while defeating the sparse gradient problem constraint of RMSProp [48].
θ θ α v ^ + ϵ m ^
g i , t = θ J ( θ i , x i , y i )
m t = β 1 m t 1 + ( 1 β 1 ) g i , t
m t = β 2 v t 1 + ( 1 β 2 ) g i , t 2
where m t is the first moment and v t indicates second moment that both are estimated. m ^ t = m t 1 β 1 t and v ^ t = v t 1 β 2 t .

5.1.5. Network Architecture

As demonstrated in Figure 7, our implementation contains three convolutional layers with each followed by a pooling layer (Max-Pooling). This method with three channel input image patches with size a of ( 1000 × 1000 pixels). The first convolutional layer has 32 filters with kernel size of ( 3 , 3 ) . Then, a pooling layer is connected with size of ( 5 , 5 ) to reduce feature maps from ( 1000 × 1000 ) to ( 200 × 200 ) . The next convolutional layer includes 32 filters with ( 3 , 3 ) kernel. Then, a 2 D MaxPooling layer is connected to scales down the feature space from ( 200 × 200 ) to ( 40 × 40 ) . The final convolutional layers contain 64 filters that kernel size is ( 3 , 3 ) . This convolutional layer is connected to a 2 D MaxPooling to scale down by ( 8 × 8 ) . The feature map is flattened, and a fully connected layers is connected to our CNN with 128 nodes. The output layer has 3 nodes that represent our parent classes: (Environmental Enteropathy, Celiac Disease, and Normal). The child level of this model as shown on the bottom of Figure 7, is similar to parent level with significant difference which is that the output layer has 4 nodes that represent our child classes: (I, IIIa, IIIb, and IIIc).
The Adam (See Section 5.1.4) optimizer is used with a learning rate of 0.001 , β 1 = 0.9 , and β 2 = 0.999 . The loss function is sparse categorical crossentropy [49]. Also, for all layers, we use a Rectified linear unit (ReLU) as the activation function except for the output layer which used a S o f t m a x (See Section 5.1.3). In this technique, we use dropout in each individual layer to address over-fitting problem [50].

5.2. Whole Slide Classification

The objective of this study was to group WSIs dependent on the diagnosis of CD and EE, and CD severity on child-level by means of the adjusted Marsh score. The model was used by training it on the patch-level and is extended to WSI. To accomplish this objective, a heuristic strategy was created which aggregated crop classifications and translated them to whole-slide inferences. Each WSI in the test set was at firstly patched, those patches which did not contain any useful information were filtered out, and then stain methods were performed on the patches (color balancing applied on parent level and stain normalization applied for CD severity). After these pre-processing steps, our prepared model was applied with the objective of image classification. We meant the likelihood dissemination over potential marks, given the patches images x and training set D by p ( y | x , D ) . Finally, this classification produces a vector of length C, where C is the number of classes. In our documentation, the likelihood is contingent on the test patch x, just as, the training set D. The trained model predicts a vector of probabilities (three for parent-level and four for child-level) that represents the likelihood an image belongs in each class. Given a probabilistic result, the patch j in slide i is assigned to the most likely class label y ^ i j as shown in Equation (19).
y ^ i j = arg max c { 1 , 2 , 3 , , C } p ( y i j = c | x i j , D )
where y ^ stands for maximum a posteriori (MAP). The summation over these vectors (output vector of all patches for a single WSI) and normalizing the resultant vector made a vector that had parts demonstrating the likelihood of a vector with three elements (CD, EE, and N) seriousness for the related WSI. Equation (20), shows how the class of WSI was anticipated.
y ^ i = arg max c { 1 , 2 , 3 , , C } j = 1 N i p ( y i j = c | x i j , D )
where the number of patches in slide i is shown by N i .

5.3. Hierarchical Medical Image Classification

The main contribution of this paper is a hierarchical medical image classification of biopsies. A common multi-class algorithm is functional and efficient for a limited number of categories. However, performance drops when we have an unequal number of data-points in our classes. In our deep learning models with various levels, this issue has been solved by creating a hierarchical structure that makes deep learning approaches for their levels of the clinical hierarchy (e.g., see Figure 7).

6. Results

In this section, we have two main results: empirical results and visualizations for patches. The empirical results are mostly used for comparing our accuracy with our baseline.

6.1. Evaluation Setup

In the computer science community, shareable and commensurate performance measures to assess an algorithm are desirable. However, in real projects, such measures may only exist for a few methods. The extensive problem when assessing the medical image categorization model is the absence of standard data collection agreement. Even if a commonplace method existed, simply choosing disparate training and test sets can introduce divergencies in model achievement [51]. Performance measures widely evaluate specific aspects of image classification. In this section, we explain different performance measures and metrics that are used in this research paper. These metrics have been calculated from a “confusion matrix” that comprises false negatives (FN) true negatives (TN), true positives (TP), and false positives (FP) [52]. The importance of these four measures may shift depending on the application. The fraction of all correctly predicted over all number of test set samples is the overall accuracy (Equation (21)). The fraction of correctly predicted over all positives is called precision, i.e., positive predictive value (Equation (22)).
a c c u r a c y = ( T P + T N ) ( T P + F P + F N + T N )
P r e c i s i o n = l = 1 L T P l l = 1 L T P l + F P l
R e c a l l = l = 1 L T P l l = 1 L T P l + F N l
F 1 S c o r e = l = 1 L 2 T P l l = 1 L 2 T P l + F P l + F N l

6.2. Experimental Setup

The following results were obtained using a combination of central processing units (CPUs) and graphical processing units (GPUs). The processing was done on a C o r e i 7 9700 F with 8 cores and 128 G B memory, and the GPU cards were two N v i d i a G e F o r c e R T X 2080 T i . We implemented our approaches in Python using the Compute Unified Device Architecture (CUDA), which is a parallel computing platform and Application Programming Interface (API) model created by N v i d i a . We also used Keras and TensorFlow libraries for creating the neural networks [49,53].

6.3. Empirical Results

In this sub-section, as we discussed in Section 6.1, we report precision, recall, and F1-score.
Table 3 shows the results of the parent level model trained for classifying between Normal, Environmental Enteropathy (EE) and Celiac Disease (CD). The precision of normal patches is 89.97 ± 0.5973 and recall is 89.35 ± 0.6133 . The F1-score of normal is 89.66 ± 0.6054 . For EE, precision is 94.02 ± 0.4955 , recall is 97.30 ± 0.3385 , F1-score is 95.63 ± 0.4270 . The CD evaluation measure for the parent level is as follows: precision is equal to 91.12 ± 0.3208 , recall is equal to 88.71 ± 0.3569 , and F1-score is equal to 89.90 ± 1.2778 .
Table 4 shows the comparison of our techniques with three different baselines. The baseline results from Convolutional Neural Network (CNN), Deep Neural Network (Multilayer perceptron), and Deep Convolutional Neural Network (DCNN) are using in this results section. Much research has been done in this domain such as ResNet, but these novel techniques can only handle small images such as 250 × 250 . In this dataset, we create 1000 patches, so we could not compare our work with ResNet, AlexNet, etc. Regarding precision, the highest is HMIC whole-slide with a mean of 88.01 percent and a confidence interval of 0.3841 followed by HMIC none whole-slide 84.13 percent and confidence interval of 0.3751 . The precision of CNN is 76.76 ± 0.4985 , multilayer perceptron is 76.19 ± 0.5030 , and DCNN is 82.95 ± 0.4439 . Regarding recall, the highest is HMIC whole-slide with a mean of 93.98 percent and a confidence interval of 0.2811 followed by HMIC non whole-slide at 93.56 percent and confidence interval of 0.291 . The recall of CNN is 80.18 ± 0.4706 , multilayer perceptron is 79.4 ± 0.471 , and DCNN is 87.28 ± 0.3933 . The highest F1-score is HMIC whole-slide with a mean of 90.89 percent and a confidence interval of 0.3804 followed by HMIC non whole-slide with 88.61 percent and confidence interval of 0.3751 . The recall of CNN is 78.43 ± 0.4855 , multilayer perceptron is 77.76 ± 0.4911 , and DCNN is 85.06 ± 0.4207 .
Table 5 shows the results by each class. For Normal images, the best classifier is DCNN with 95.14 ± 0.42 recall of 94.91 ± 0.43 F1-score of 95.14 ± 0.42 . For EE, HMIC is the best classifier. The whole-slide images classifier for parent level is more robust in comparison with non -whole slide with precision of 94.08 ± 0.49 Recall of 97.33 ± 0.42 F1-score of 98.68 ± 0.42 . Although the results of Normal and EE Images are very similar to flat models such as DCNN, but the results of sub-class of CD contains 4 different stages and the margin is very high. The best flat model (non-hierarchical) is DCNN with mean of F1-score of 73.99 for I, 71.63 for IIIa, 77.74 for IIIb, and 75.71 IIIc.
The Table 5 indicates the margin for child level is very high even for the non whole-slide level of this dataset. The best results belong to the whole-slide classifier for parent level with precision with 88.73 ± 1.34 for I, 81.19 ± 1.65 for IIIa, 90.51 ± 1.24 for IIIb, 89.26 ± 1.31 for IIIc. The whole-slide classifier for parent level with recall with 85.07 ± 1.51 for I, 81.19 ± 1.65 for IIIa, 90.48 ± 1.27 for IIIb, 90.18 ± 1.26 for IIIc. The results of whole-slide classifier for parent level for recall is 85.07 ± 1.51 for I, 83.72 ± 0.78 for IIIa, 90.48 ± 0.61 for IIIb, 90.18 ± 1.26 for IIIc. Finally, The F1-score for whole-slide classifier for parent level is equal to 86.86 ± 1.43 for I, 82.44 ± 1.51 for IIIa, 90.49 ± 1.16 for IIIb, 89.72 ± 1.28 .

6.4. Visualization

Grad-CAMs were generated for 41 patches (18 EE, 14 Celiac Disease, and 9 histologically normal duodenal controls) which mainly focused on distinct, yet medically relevant cellular features outlined below. Although, most heatmaps focused on medically relevant features, there were some patches that focused on too many features (n = 8) or focused on connective tissue debris (n = 10) that we were unable to categorize.
As shown in Figure 8, three categories are describe as follows:
  • EE: surface epithelium with IELs and goblet cells was highlighted. Within the lamina propria, the heatmaps also focused on mononuclear cells.
  • CD: heatmaps highlighted the edge of crypt cross sections, surface epithelium with IELs and goblet cells, and areas with mononuclear cells within the lamina propria.
  • Histologically Normal: surface epithelium with epithelial cells containing abundant cytoplasm was highlighted.

7. Conclusions

Medical image classification is a significant problem to address, given the growing number of medical instruments to collect digital images. When medical images are organized hierarchically, multi-class approaches are difficult to apply using traditional supervised learning methods. This paper introduces a novel approach to hierarchical medical image classification, HMIC, that could use multiple deep convolutional neural networks approaches to produce hierarchical classifications, and in our experimental results, we use two level of CNNs hierarchy. Testing on a medical image data set shows that this technique produced robust results at the higher and lower level, and the accuracy is consistently higher than those obtainable by conventional approaches using CNN, Multi-layer perceptron, and DCNN. These results show that hierarchical deep learning method could provide improvements for classification and that they provide flexibility to classify these data within a hierarchy. Hence, they provide extensions over current and traditional methods that only consider the multi-class problem.
This modeling approach can be extended in a couple of ways. Additional training and testing with other hierarchically structured clinical data will help to identify other architectures that work better for these problems. Also, deeper levels of hierarchy is another possible extension of this approach. For instance, if the stage of the disease is treated as ordered then the hierarchy continues down multiple levels. Scoring here could be performed on small sets using human judges.

Author Contributions

K.K., S.S., and D.B. worked on the Concept and design of the platform. K.K. worked on the implementation of these models. K.K., S.S., and L.E. worked on the analysis and interpretation of data. K.K. worked on the drafting of the manuscript. K.K., R.S., and W.A. worked on the critical revision of the manuscript for important intellectual content. D.B., S.S., B.A., S.M. and A.A. obtained funding. This work was under the supervision of S.S., P.K., A.A., S.M., and D.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by University of Virginia, Engineering in Medicine SEED Grant ( S S & D E B ) , the University of Virginia Translational Health Research Institute of Virginia ( T H R I V ) Mentored Career Development Award ( S S ) , and the Bill and Melinda Gates Foundation (AA, OPP1138727; SRM, OPP1144149; PK, OPP1066118). Research reported in this publication was supported by [National Institute of Diabetes and Digestive and Kidney Diseases] of the National Institutes of Health under award number K23 DK117061-01A1. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Conflicts of Interest

The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

References

  1. Sali, R.; Ehsan, L.; Kowsari, K.; Khan, M.; Moskaluk, C.A.; Syed, S.; Brown, D.E. CeliacNet: Celiac Disease Severity Diagnosis on Duodenal Histopathological Images Using Deep Residual Networks. arXiv 2019, arXiv:1910.03084. [Google Scholar]
  2. Kowsari, K.; Sali, R.; Khan, M.N.; Adorno, W.; Ali, S.A.; Moore, S.R.; Amadi, B.C.; Kelly, P.; Syed, S.; Brown, D.E. Diagnosis of celiac disease and environmental enteropathy on biopsy images using color balancing on convolutional neural networks. In Proceedings of the Future Technologies Conference; Springer: Cham, Switzerland, 2019; pp. 750–765. [Google Scholar]
  3. Kowsari, K. Diagnosis and Analysis of Celiac Disease and Environmental Enteropathy on Biopsy Images using Deep Learning Approaches. Ph.D. Thesis, University of California, Los Angeles, CA, USA, 2020. [Google Scholar] [CrossRef]
  4. Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text Classification Algorithms: A Survey. Information 2019, 10, 150. [Google Scholar] [CrossRef] [Green Version]
  5. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [Green Version]
  6. Nobles, A.L.; Glenn, J.J.; Kowsari, K.; Teachman, B.A.; Barnes, L.E. Identification of imminent suicide risk among young adults using text messages. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; ACM: New York, NY, USA, 2018; p. 413. [Google Scholar]
  7. Zhai, S.; Cheng, Y.; Zhang, Z.M.; Lu, W. Doubly convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1082–1090. [Google Scholar]
  8. Hegde, R.B.; Prasad, K.; Hebbar, H.; Singh, B.M.K. Comparison of traditional image processing and deep learning approaches for classification of white blood cells in peripheral blood smear images. Biocybern. Biomed. Eng. 2019, 39, 382–392. [Google Scholar] [CrossRef]
  9. Zhang, J.; Kowsari, K.; Harrison, J.H.; Lobo, J.M.; Barnes, L.E. Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record. IEEE Access 2018, 6, 65333–65346. [Google Scholar] [CrossRef]
  10. Pavik, I.; Jaeger, P.; Ebner, L.; Wagner, C.A.; Petzold, K.; Spichtig, D.; Poster, D.; Wüthrich, R.P.; Russmann, S.; Serra, A.L. Secreted Klotho and FGF23 in chronic kidney disease Stage 1 to 5: A sequence suggested from a cross-sectional study. Nephrol. Dial. Transplant. 2013, 28, 352–359. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  11. Kowsari, K.; Brown, D.E.; Heidarysafa, M.; Meimandi, K.J.; Gerber, M.S.; Barnes, L.E. Hdltex: Hierarchical deep learning for text classification. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 364–371. [Google Scholar]
  12. Dumais, S.; Chen, H. Hierarchical classification of web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 24–28 July 2000; pp. 256–263. [Google Scholar]
  13. Yan, Z.; Piramuthu, R.; Jagadeesh, V.; Di, W.; Decoste, D. Hierarchical Deep Convolutional Neural Network for Image Classification. U.S. Patent 10,387,773, 20 August 2019. [Google Scholar]
  14. Seo, Y.; Shin, K.S. Hierarchical convolutional neural networks for fashion image classification. Expert Syst. Appl. 2019, 116, 328–339. [Google Scholar] [CrossRef]
  15. Ranjan, N.; Machingal, P.V.; Jammalmadka, S.S.D.; Thenaknidiyoor, V.; Dileep, A. Hierarchical Approach for Breast cancer Histopathology Images Classification. 2018. Available online: https://openreview.net/forum?id=rJlGvTojG (accessed on 10 January 2019).
  16. Zhu, X.; Bain, M. B-CNN: Branch convolutional neural network for hierarchical classification. arXiv 2017, arXiv:1709.09890. [Google Scholar]
  17. Sali, R.; Adewole, S.; Ehsan, L.; Denson, L.A.; Kelly, P.; Amadi, B.C.; Holtz, L.; Ali, S.A.; Moore, S.R.; Syed, S.; et al. Hierarchical Deep Convolutional Neural Networks for Multi-category Diagnosis of Gastrointestinal Disorders on Histopathological Images. arXiv 2020, arXiv:2005.03868. [Google Scholar]
  18. Syed, S.; Ali, A.; Duggan, C. Environmental enteric dysfunction in children: A review. J. Pediatr. Gastroenterol. Nutr. 2016, 63, 6. [Google Scholar] [CrossRef] [Green Version]
  19. Naylor, C.; Lu, M.; Haque, R.; Mondal, D.; Buonomo, E.; Nayak, U.; Mychaleckyj, J.C.; Kirkpatrick, B.; Colgate, R.; Carmolli, M.; et al. Environmental enteropathy, oral vaccine failure and growth faltering in infants in Bangladesh. EBioMedicine 2015, 2, 1759–1766. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  20. Husby, S.; Koletzko, S.; Korponay-Szabó, I.R.; Mearin, M.L.; Phillips, A.; Shamir, R.; Troncone, R.; Giersiepen, K.; Branski, D.; Catassi, C.; et al. European Society for Pediatric Gastroenterology, Hepatology, and Nutrition guidelines for the diagnosis of coeliac disease. J. Pediatr. Gastroenterol. Nutr. 2012, 54, 136–160. [Google Scholar] [CrossRef] [PubMed]
  21. Fasano, A.; Catassi, C. Current approaches to diagnosis and treatment of celiac disease: An evolving spectrum. Gastroenterology 2001, 120, 636–651. [Google Scholar] [CrossRef] [PubMed]
  22. Hou, L.; Samaras, D.; Kurc, T.M.; Gao, Y.; Davis, J.E.; Saltz, J.H. Patch-based convolutional neural network for whole slide tissue image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2424–2433. [Google Scholar]
  23. Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Volume 1. [Google Scholar]
  24. Wang, W.; Huang, Y.; Wang, Y.; Wang, L. Generalized autoencoder: A neural network framework for dimensionality reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 490–497. [Google Scholar]
  25. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation; Technical Report; California Univ San Diego La Jolla Inst for Cognitive Science: La Jolla, CA, USA, 1985. [Google Scholar]
  26. Liang, H.; Sun, X.; Sun, Y.; Gao, Y. Text feature extraction based on deep learning: A review. EURASIP J. Wirel. Commun. Netw. 2017, 2017, 211. [Google Scholar] [CrossRef]
  27. Masci, J.; Meier, U.; Cireşan, D.; Schmidhuber, J. Stacked convolutional auto-encoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2011; pp. 52–59. [Google Scholar]
  28. Chen, K.; Seuret, M.; Liwicki, M.; Hennebert, J.; Ingold, R. Page segmentation of historical document images with convolutional autoencoders. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR); IEEE: Washington, DC, USA, 2015; pp. 1011–1015. [Google Scholar]
  29. Geng, J.; Fan, J.; Wang, H.; Ma, X.; Li, B.; Chen, F. High-resolution SAR image classification via deep convolutional autoencoders. IEEE Geosci. Remote. Sens. Lett. 2015, 12, 2351–2355. [Google Scholar] [CrossRef]
  30. Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
  31. Gao, Q.; Xu, H.X.; Han, H.G.; Guo, M. Soft-sensor Method for Surface Water Qualities Based on Fuzzy Neural Network. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; pp. 6877–6881. [Google Scholar]
  32. Kowsari, K.; Yammahi, M.; Bari, N.; Vichr, R.; Alsaby, F.; Berkovich, S.Y. Construction of fuzzyfind dictionary using golay coding transformation for searching applications. arXiv 2015, arXiv:1503.06483. [Google Scholar] [CrossRef] [Green Version]
  33. Kowsari, K.; Alassaf, M.H. Weighted unsupervised learning for 3d object detection. arXiv 2016, arXiv:1602.05920. [Google Scholar] [CrossRef] [Green Version]
  34. Alassaf, M.H.; Kowsari, K.; Hahn, J.K. Automatic, real time, unsupervised spatio-temporal 3d object detection using rgb-d cameras. In Proceedings of the 2015 19th International Conference on Information Visualisation, Barcelona, Spain, 22–24 July 2015; pp. 444–449. [Google Scholar]
  35. Manning, C.D.; Raghavan, P.; Schutze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; Volume 20, pp. 405–416. [Google Scholar]
  36. Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The Planar k-Means Problem is NP-Hard. In WALCOM: Algorithms and Computation; Das, S., Uehara, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 274–285. [Google Scholar]
  37. Fischer, A.H.; Jacobson, K.A.; Rose, J.; Zeller, R. Hematoxylin and eosin staining of tissue and cell sections. Cold Spring Harb. Protoc. 2008, 2008, pdb–prot4986. [Google Scholar] [CrossRef] [PubMed]
  38. Anderson, J. An introduction to Routine and special staining. Retrieved August 2011, 18, 2014. [Google Scholar]
  39. Khan, A.M.; Rajpoot, N.; Treanor, D.; Magee, D. A nonlinear mapping approach to stain normalization in digital histopathology images using image-specific color deconvolution. IEEE Trans. Biomed. Eng. 2014, 61, 1729–1738. [Google Scholar] [CrossRef] [PubMed]
  40. Bianco, S.; Cusano, C.; Napoletano, P.; Schettini, R. Improving CNN-Based Texture Classification by Color Balancing. J. Imaging 2017, 3, 33. [Google Scholar] [CrossRef] [Green Version]
  41. Bianco, S.; Schettini, R. Error-tolerant color rendering for digital cameras. J. Math. Imaging Vis. 2014, 50, 235–245. [Google Scholar] [CrossRef]
  42. Vahadane, A.; Peng, T.; Sethi, A.; Albarqouni, S.; Wang, L.; Baust, M.; Steiger, K.; Schlitter, A.M.; Esposito, I.; Navab, N. Structure-preserving color normalization and sparse stain separation for histological images. IEEE Trans. Med Imaging 2016, 35, 1962–1971. [Google Scholar] [CrossRef]
  43. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  44. Kowsari, K.; Heidarysafa, M.; Brown, D.E.; Meimandi, K.J.; Barnes, L.E. Rmdl: Random multimodel deep learning for classification. In Proceedings of the 2nd International Conference on Information System and Data Mining, Lakeland, FL, USA, 9–11 April 2018; pp. 19–28. [Google Scholar]
  45. Li, Q.; Cai, W.; Wang, X.; Zhou, Y.; Feng, D.D.; Chen, M. Medical image classification with convolutional neural network. In Proceedings of the 2014 13th International Conference on Control Automation Robotics & Vision (ICARCV), Singapore, 10–12 December 2014; pp. 844–848. [Google Scholar]
  46. Heidarysafa, M.; Kowsari, K.; Brown, D.E.; Jafari Meimandi, K.; Barnes, L.E. An Improvement of Data Classification Using Random Multimodel Deep Learning (RMDL). arXiv 2018, arXiv:1808.08121. [Google Scholar] [CrossRef]
  47. Scherer, D.; Müller, A.; Behnke, S. Evaluation of pooling operations in convolutional architectures for object recognition. In Proceedings of the Artificial Neural Networks–ICANN 2010, Thessaloniki, Greece, 15–18 September 2010; pp. 92–101. [Google Scholar]
  48. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  49. Chollet, F. Keras: Deep Learning Library for Theano and Tensorflow. 2015. Available online: https://keras.io/ (accessed on 19 August 2019).
  50. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  51. Yang, Y. An evaluation of statistical approaches to text categorization. Inf. Retr. 1999, 1, 69–90. [Google Scholar] [CrossRef]
  52. Lever, J.; Krzywinski, M.; Altman, N. Points of significance: Classification evaluation. Nat. Methods 2016, 13, 603–604. [Google Scholar] [CrossRef]
  53. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
Figure 1. HMIC: Hierarchical Medical Image Classification.
Figure 1. HMIC: Hierarchical Medical Image Classification.
Information 11 00318 g001
Figure 2. Pipeline of patching and applying an autoencoder to find useful patches for the training model. The biopsy images are very large, so we need to divide into smaller patches to be used in the machine learning model. As you can see in the image, many of these patches are empty. After using an autoencoder, we can apply a clustering algorithm to discard useless patches (green patches contain useful information, while red patches do not).
Figure 2. Pipeline of patching and applying an autoencoder to find useful patches for the training model. The biopsy images are very large, so we need to divide into smaller patches to be used in the machine learning model. As you can see in the image, many of these patches are empty. After using an autoencoder, we can apply a clustering algorithm to discard useless patches (green patches contain useful information, while red patches do not).
Information 11 00318 g002
Figure 3. Example autoencoder architecture with K-means applied on the bottle-neck layer feature vector to cluster useful and not useful patches.
Figure 3. Example autoencoder architecture with K-means applied on the bottle-neck layer feature vector to cluster useful and not useful patches.
Information 11 00318 g003
Figure 4. Some samples of clustering results—cluster 1 includes patches with useful information and cluster 2 includes patches without useful information (mostly created from background parts of WSIs).
Figure 4. Some samples of clustering results—cluster 1 includes patches with useful information and cluster 2 includes patches without useful information (mostly created from background parts of WSIs).
Information 11 00318 g004
Figure 5. Color Balancing samples for the three classes.
Figure 5. Color Balancing samples for the three classes.
Information 11 00318 g005
Figure 6. Stain normalization results when using the method proposed by Vahadane et al. [42]. Images in the first row represent the source images. The source images are normalized images to the stain appearance of the target image in second row [1].
Figure 6. Stain normalization results when using the method proposed by Vahadane et al. [42]. Images in the first row represent the source images. The source images are normalized images to the stain appearance of the target image in second row [1].
Information 11 00318 g006
Figure 7. Structure of Convolutional Neural Net using multiple 2D feature detectors and 2D max-pooling.
Figure 7. Structure of Convolutional Neural Net using multiple 2D feature detectors and 2D max-pooling.
Information 11 00318 g007
Figure 8. Grad-CAM results for showing feature importance.
Figure 8. Grad-CAM results for showing feature importance.
Information 11 00318 g008
Table 1. Population results of biopsies dataset.
Table 1. Population results of biopsies dataset.
Total PopulationPakistanZambiaUS
Data150EE (n = 10)EE (n = 16)Celiac (n = 63)Normal (n = 61)
Biopsy
Images
4612919239174
Age, median
(IQR), months
37.5
(19.0 to 121.5)
22.2
(20.8 to 23.4)
16.5
(9.5 to 21.0)
130.0
(85.0 to 176.0)
25.0
(16.5 to 41.0)
Gender,
n (%)
M = 77 (%51.3)
F = 73 (%48.7)
M = 5 (%50)
F = 5 (%50)
M = 10 (%62.5)
F = 6 (%37.5)
M = 29 (%46)
F = 34 (%54)
M = 33 (%54)
F = 28 (%46)
LAZ/ HAZ,
median (IQR)
−0.6
(−1.9 to 0.4)
−2.8
(−3.6 to -2.3)
−3.1
(−4.1 to −2.2)
−0.3
(−0.8 to 0.7)
0.2
( 1.3 to 0.5)
Table 2. Dataset used for Hierarchical Medical Image Classification (HMIC).
Table 2. Dataset used for Hierarchical Medical Image Classification (HMIC).
DataTrainTestTotal
Normal22,676971732,393
Environmental Enteropathy20,516879229,308
Celiac Disease ParentChildParentChildParentChild
I21,14049889058213730,1987125
IIIa479020526842
IIIb568424368120
IIIc567824338111
Table 3. Result of parent level classifications for normal, environmental enteropathy, and Celiac disease.
Table 3. Result of parent level classifications for normal, environmental enteropathy, and Celiac disease.
PrecisionRecallF1-Score
Normal89.97 ± 0.5989.35 ± 0.6189.66 ± 0.60
Environmental Enteropathy94.02 ± 0.4997.30 ± 0.3395.63 ± 0.42
Celiac Disease91.12 ± 0.3288.71 ± 0.3589.90 ± 1.27
Table 4. Results of HMIC with comparison with our baseline.
Table 4. Results of HMIC with comparison with our baseline.
ModelPrecisionRecallF1-Score
BaselineCNN76.76 ± 0.4980.18 ± 0.4778.43 ± 0.48
Multilayer perceptron76.19 ± 0.5079.40 ± 0.4777.76 ± 0.49
Deep CNN82.95 ± 0.4487.28 ± 0.3985.06 ± 0.42
HMICNon Whole slide84.13 ± 0.3793.56 ± 0.2988.61 ± 0.37
Whole slide88.01 ± 0.3893.98 ± 0.2890.89 ± 0.38
Table 5. Results per-classed of HMIC with comparison with our baseline.
Table 5. Results per-classed of HMIC with comparison with our baseline.
ModelPrecisionRecallF1-Score
BaselineCNNNormal87.83 ± 0.5790.77 ± 0.6589.28 ± 0.61
Environmental Enteropathy90.93 ± 0.6182.48 ± 0.7986.50 ± 0.71
Celiac DiseaseI68.37 ± 1.9868.62 ± 1.9668.50 ± 1.96
IIIa56.26 ± 1.0156.26 ± 2.2159.29 ± 1.95
IIIb65.28 ± 0.9798.28 ± 2.0166.64 ± 1.87
IIIc62.66 ± 1.9966.83 ± 1.9964.68 ± 2.02
Multilayer
perceptron
Normal87.97 ± 0.7681.87 ± 0.7684.81 ± 0.71
Environmental Enteropathy87.25 ± 0.6990.18 ± 0.6288.69 ± 0.66
Celiac DiseaseI57.92 ± 2.0760.74 ± 2.0759.30 ± 2.09
IIIa62.58 ± 2.0962.18 ± 2.0960.89 ± 2.11
IIIb65.00 ± 1.8966.09 ± 1.8765.56 ± 1.88
IIIc67.97 ± 1.8574.85 ± 1.7271.24 ± 1.78
DCNNNormal95.14 ± 0.4294.91 ± 0.4395.14 ± 0.42
Environmental Enteropathy92.22 ± 0.5590.62 ± 0.6091.52 ± 0.58
Celiac DiseaseI75.41 ± 1.8272.63 ± 1.8973.99 ± 1.85
IIIa70.81 ± 1.9272.47 ± 1.9371.63 ± 1.79
IIIb81.08 ± 0.8174.67 ± 1.8477.74 ± 1.65
IIIc75.07 ± 1.8376.37 ± 1.8175.71 ± 1.81
HMICNon Whole SlideNormal89.97 ± 0.5989.35 ± 0.6189.66 ± 0.61
Environmental Enteropathy94.02 ± 0.4997.30 ± 0.3395.63 ± 0.33
Celiac DiseaseI83.25 ± 1.5880.91 ± 1.6682.06 ± 1.62
IIIa80.34 ± 1.6280.46 ± 1.7180.40 ± 1.57
IIIb85.35 ± 1.4981.77 ± 1.6783.52 ± 1.47
IIIc85.54 ± 1.4982.71 ± 1.6084.10 ± 1.55
Whole SlideNormal90.64 ± 0.5790.06 ± 0.5790.35 ± 0.58
Environmental Enteropathy94.08 ± 0.4997.33 ± 0.4298.68 ± 0.42
Celiac DiseaseI88.73 ± 1.3485.07 ± 1.5186.86 ± 1.43
IIIa81.19 ± 1.6581.19 ± 1.6582.44 ± 1.51
IIIb90.51 ± 1.2490.48 ± 1.2790.49 ± 1.16
IIIc89.26 ± 1.3190.18 ± 1.2689.72 ± 1.28

Share and Cite

MDPI and ACS Style

Kowsari, K.; Sali, R.; Ehsan, L.; Adorno, W.; Ali, A.; Moore, S.; Amadi, B.; Kelly, P.; Syed, S.; Brown, D. HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach. Information 2020, 11, 318. https://doi.org/10.3390/info11060318

AMA Style

Kowsari K, Sali R, Ehsan L, Adorno W, Ali A, Moore S, Amadi B, Kelly P, Syed S, Brown D. HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach. Information. 2020; 11(6):318. https://doi.org/10.3390/info11060318

Chicago/Turabian Style

Kowsari, Kamran, Rasoul Sali, Lubaina Ehsan, William Adorno, Asad Ali, Sean Moore, Beatrice Amadi, Paul Kelly, Sana Syed, and Donald Brown. 2020. "HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach" Information 11, no. 6: 318. https://doi.org/10.3390/info11060318

APA Style

Kowsari, K., Sali, R., Ehsan, L., Adorno, W., Ali, A., Moore, S., Amadi, B., Kelly, P., Syed, S., & Brown, D. (2020). HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach. Information, 11(6), 318. https://doi.org/10.3390/info11060318

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop