1. Introduction
Histopathological analysis involves procedures that aim to investigate tissue samples that are commonly stained with specific dyes, such as hematoxylin and eosin (H&E) [
1,
2]. In these processes, specialists identify unusual alterations in the structures of cells and highlight potential abnormal health conditions, such as the diagnosis of cancer. It is noteworthy that this type of disease is a significant cause of early deaths worldwide and has high social and economic costs [
3]. For instance, cancer is the second leading cause of death in the United States [
4]. Thus, the early detection of diseases often enables less invasive treatments and increases the possibility of finding a cure and/or patient survival. The required steps in the preparation process of H&E images can influence the presentation of histological aspects, further increasing the difficulty of accurately diagnosing diseases under investigation. In addition to this problem, the analysis process takes time and may be susceptible to subjective interpretations by specialists [
5,
6,
7]. These interpretation problems are mainly caused by human issues, such as subjectivity and fatigue. On the other hand, computer-aided diagnosis (CAD) methods play a fundamental role in this task since they can support specialists with second opinions [
8,
9,
10], especially regarding H&E images [
5,
8,
11,
12,
13,
14,
15,
16,
17].
In this regard, two categories of descriptors are typically investigated for the development of CAD systems. The first category consists of handcrafted features (HFs) defined by distinct extraction methods, usually aiming to overcome specific problems [
18,
19,
20,
21,
22]. Among the HFs, it is possible to highlight techniques based on fractal geometry that use multiscale and multidimensional methods (Higuchi fractal dimension, probabilistic fractal dimension, box fusion fractal dimension, lacunarity and percolation) [
23,
24,
25,
26,
27,
28,
29], Haralick [
30] and local binary patterns (LBPs) [
31]. For instance, Haralick and LBPs have been applied in several imaging contexts [
32,
33,
34], exploring the identification of lung cancer subtypes [
35], the presence of cancerous characteristics in breast tissue samples [
18,
36] and the classification of colorectal cancer [
6]. In addition, techniques that involve fractals at multiple scales and/or dimensions have also been applied to quantify the pathological architectures of tumors [
23,
25,
26], demonstrating relevant results in the pattern recognition of prostate cancer [
37], lymphomas [
38], intraepithelial neoplasia [
39], breast tumors [
40], colorectal cancer [
13] and psoriatic lesions [
41]. Moreover, fractal methods are important for texture analysis because they provide information about the complexity of textures and patterns that are similar at various levels of magnification [
42].
The second group of descriptors consists of deep-learned (DL) features obtained using convolution neural networks (CNNs) [
43]. This group has been useful for defining different CAD approaches [
15,
17,
22,
44,
45,
46,
47,
48] that consider data representations with multiple levels of abstraction [
8,
47,
49]. The most explored models have provided the best accuracy rates on the ImageNet dataset [
50], such as AlexNet [
51], GoogleNet [
52], VGGNet [
53], ResNet [
54], Inception [
52] and other applications [
20,
32,
55].
Despite the advances provided individually by DL features and HFs, investigations have been carried out to develop models based on combinations of these features [
18,
21,
32,
55,
56,
57,
58], generating strategies known as ensemble learning (EL) [
20,
59]. Moreover, these studies have indicated no single solution for distinct datasets. EL models can also consider distinct classification algorithms in order to obtain more accurate single decisions by applying ensembles of classifiers [
20,
60]. This type of association has provided important results in the study of cervical cell imaging [
20]. Another highlight is the model presented by [
59], who conducted a comparative study between a logistic regression classifier trained only with DL features, an ensemble of HFs and an ensemble of all features. The authors concluded that the ensemble involving all features delivered the best distinction rates. In the context of using fractal techniques with CNN models, the method presented by [
29] considered an ensemble involving two CNN architectures: one pre-trained with histological images and the other pre-trained with artificial images, which were generated using features from fractal techniques. The authors concluded that their proposal outperformed classification algorithms and CNN models when applied separately.
It is essential to observe that EL models can be developed with the most relevant features, exploring a single selection stage to reduce the search space and increase the accuracy of the system [
13,
15,
18]. The use of feature selection through two stages can also be implemented [
61,
62,
63]. In these cases, the strategies were applied to the ensemble of DL, and the results obtained were better than those obtained via a single stage. Thus, this approach can provide a reduced number of main compositions for developing CAD systems, making knowledge even more comprehensive for specialists. Moreover, the use of this combination is not restricted to image analysis with promising solutions in the frequency domain [
64].
The versatility of EL strategies was observed with different combinations between features and classifiers to investigate some types of histological images [
20,
21,
29,
32,
55,
56,
59,
65,
66], but not as described here, which were addressed to define patterns of techniques through multiple H&E datasets. Some examples of EL strategies that can still be investigated are the HFs, DL features, and the HFs and DL features, all of which are in a classifier ensemble context. The best configuration can be compared via classifications only with the use of CNN models, which are useful to indicate the pertinence of using ensemble learning in pattern recognition of various H&E images (colorectal cancer, oral dysplasia, non-Hodgkin’s lymphoma, and liver tissue). Moreover, the previously indicated EL models can also be explored via feature selection in two stages, which is a valuable approach to present more optimized solutions, in addition to significantly reducing a high-dimensional search space, such as those explored here. Thus, known problems such as overfitting or underfitting are minimized [
67]. An EL model capable of providing the main solutions for various H&E images, with robust computational approaches for pattern recognition, can significantly improve CAD systems and make knowledge more comprehensive for specialists.
This work presents an EL approach to classify histological images from different contexts. The proposal explored multiple handcrafted features through multidimensional and multiscale fractal techniques (Higuchi fractal dimension, probabilistic fractal dimension, box fusion fractal dimension, lacunarity, and percolation), Haralick and LBPs descriptors, and deep-learned descriptors, which were obtained from several convolutional neural network architectures. Moreover, a two-stage feature selection (ranking with metaheuristic algorithms) with a heterogeneous ensemble of classifiers completed the proposed method to indicate the best solutions. The first stage of selection was defined through the ReliefF algorithm. In the second phase, the approach to discover the most effective features within each reduced subset was employed, exploring particle swarm optimization, genetic algorithm, and binary gray wolf optimization. Each result was verified through a robust ensemble process with a Support Vector Machine, Naive Bayes, Random Forest, Logistic Regression, and K-Nearest Neighbors. This proposal provided the following contributions:
An EL approach not yet explored in H&E image classification, able to identify the primary combinations of features via two-stage feature selection (ranking with meta-heuristics) with a heterogeneous ensemble of classifiers;
Best ensembles of descriptors to distinguish multiple histological datasets that have been stained with H&E;
An analysis of the proposal’s usefulness concerning relevant models available in the specialized literature with indications of the best performances concerning colorectal cancer, oral epithelial dysplasia, and gender classification from liver tissue. This was achieved by utilizing a limited number of features, ranging from 11 to 29 attributes;
A more robust baseline approach, with solutions without overfitting, which is useful in evaluating and composing new approaches for pattern recognition in histological images;
A breakdown of the main descriptors present in the best ensembles, making the knowledge comprehensive for specialists and helpful in improving CAD systems.
Section 2 presents the proposed methodology, providing information about the techniques used to compose the ensemble learning approach.
Section 3 shows the results and engages in a discussion following the application of this approach. Finally,
Section 4 indicates the main findings and suggestions for future exploration.
3. Results and Discussion
The ensemble learning approach was evaluated on five datasets of H&E histological images, as described in
Section 2, with comparisons involving the different classes of each set. It is important to note that 45 types of tests were performed to explore different compositions of ensembles, including three associations of wrapper methods, in order to provide the main compositions among the 100 best-ranked descriptors with ReliefF. Each composition was evaluated via a heterogeneous ensemble of classifiers (
Section 2.6.1). In
Table 5, the average performances for each ensemble, considering the HFs and DL attributes, are shown. The best rates are highlighted in bold.
From
Table 5, it is observed that the HFs and HFs+DL ensembles were responsible for the best results in four H&E datasets (OED, LA, LG, and NHL) out of the five investigated here. The accuracy values ranged from 90.72% to 100%. Thus, it is possible to indicate that the handcrafted descriptors explored here (via HFs ensemble) are relevant for the classification process whether used separately or in combination with DL. The HFs ensemble provided the highest distinction rate on the OED dataset regardless of the wrapper selector taken as reference, indicating the optimal match on this dataset, which represents a further contribution of the proposed approach. On the other hand, HFs presented the lowest accuracy values (approximately 78%) in the NHL dataset, with three classes (CLL×FL×MCL), demonstrating a possible limit of HFs. When this category was combined with DL (HFs+DL ensemble), the result was the most expressive for the NHL dataset with an accuracy of 90.72%. Even so, this result represents an accuracy of at least 8% lower than the values achieved on other datasets. This is an important indication of the difficulties in distinguishing CLL×FL×MCL groups, especially considering only the HFs ensemble. Finally, when the DL ensemble is considered, the best solution was achieved in a single dataset (CR) but with an expressive rate (99.76% accuracy), illustrating its importance for the development of strategies to support the diagnosis of colorectal cancer. In addition, on this dataset, it is worth mentioning the HFs+DL ensemble as another potential solution, which achieved an accuracy of 99.58%, which was very close to that provided by the DL combination. This configuration represents an acceptable and common solution for different types of histological samples.
To summarize the results discussed here, the best combinations of descriptors and selection algorithms are presented in
Table 6, including the total number of descriptors, AUC, and accuracy averages. The accuracy values of the top 1 and top 10 solutions are also indicated, making it possible to observe the existing variation for the first and tenth solutions in each dataset, since the averages were calculated from the 10 best-ranked compositions in each dataset. It is important to emphasize that this ranking indicated the highest accuracy with the lowest number of descriptors.
Based on the previously stated criterion, it can be reiterated that the HFs and HFs+DL ensembles were responsible for the best results in four of the five H&E-stained histological datasets investigated here. In these cases, the ReliefF + bGWO selection processes stand out with three occurrences. This indicates another pattern for the CR, LA, and LG datasets, with expressive average AUC rates of 0.999 (LA) and 1 (CR and LG). In addition, the lowest top 10 accuracy was significant, with a rate above 98% (LA), on the dataset that involves four groups for the classification process. Moreover, the main solutions for these three datasets indicated a reduced number of descriptors, an average of 21 features for CR, 29 for LG, and 40 for LA.
Regarding the two-stage ReliefF + PSO, this strategy was the main solution for the OED dataset, providing maximum accuracy with the lowest number of descriptors among all solutions with only 11 features. When the two-stage ReliefF + GA is observed, this approach constitutes the best solution on a single dataset (NHL). In this case, the solutions explored 53 features on average, identifying the highest value among all solutions. The top 10 accuracy was 89.57%, and the top 1 was slightly better, 92.25%, reinforcing the difficulties present in this set. The NHL dataset comprises three classes (CLL×FL×MCL) and, possibly, with less heterogeneous histological patterns, implying more difficulties in the constitutions of the solutions. Even so, in this case, the average AUC was 0.98, which is an important value under the exposed conditions. For instance, considering that the original feature space had a range of 462 to 11,086 values, the outcome achieved in this study is another contribution that is capable of providing expressive average performances with few descriptors but highly relevant to the classification process.
3.1. Feature Occurrences in the Main Solutions: An Overview
To identify the descriptors present in the top solutions, as summarized in
Table 6, a survey of the occurrence of each category of features in the first 10 solutions of each H&E-stained histological dataset was performed. To better understand the origin of each descriptor in each solution, the occurrences of the deep-learned descriptors are in
Figure 5, CR, LG, and NHL datasets, and the handcrafted ones are in
Figure 6, indicating solutions for LA, LG, OED, and NHL. It is important to highlight that the best solution for the LA dataset involved only HFs. Occurrences in NHL and LG also involved HFs due to the HFs+DL ensemble, justifying the representation of these datasets in
Figure 6. Also, in these two datasets, the occurrence percentages were calculated relative to the total number of HFs in the HFs+DL ensemble, disregarding the percentages of deep-learned features.
Considering the distributions illustrated in
Figure 5, it is possible to verify some behaviors. The lowest occurrences occurred concerning the Inception v3 network descriptors, a maximum of 3.37% for the CR dataset, and there were no instances among the best 10 solutions for the LG set. On the other hand, the descriptors via DenseNet-121 and EfficientNet-B2 networks have the highest occurrences, especially for the NHL dataset, in which 63.45% of the features originated from the DenseNet-121 model. Descriptors via the EfficientNet-B2 architecture stood out in the solutions for the CR dataset with 38.46% of the occurrences surpassing more homogeneous occurrences (from 17.79% to 21.63%) for the deep-learned features from the ResNet-50, VGG-19, and DenseNet-121. Another homogeneous distribution can be seen in the LG dataset, involving the same descriptor origins as the CR set. In this case, occurrences ranged from 15.17% to 23.45%. When DL versus HFs totalization is considered, it can be seen that DL attributes predominated in the solutions for the LG and NHL datasets with occurrences of 65.17% and 97.16%, respectively. Despite these differences, it is not possible to indicate that these were the most important in the classifications.
Concerning the occurrences of handcrafted descriptors (
Figure 6), the lowest occurrence was of the box-merging fractal dimension (
from [
68]), since it was not selected for the top 10 solutions in three out of four histological datasets stained with H&E. This descriptor was present in the solutions for OED but with the lowest occurrence, only 1.84%. The probabilistic fractal dimension descriptor (
from [
24,
41]) was the second lowest occurrence but constituted the solutions for three of the four histological datasets. Another interesting result involves the enhanced version of Higuchi fractal dimension descriptor (
[
28]) with occurrences that surpassed those of the
and
approaches, which are widely explored in the literature [
24,
41,
68], contributing to advancements in this particular research field. Finally, it is possible to observe the descriptors with the highest occurrences for each H&E dataset: lacunarity (45.80%) for LA; percolation (46.36%) for OED; LBPs (39.60%) for LG; and with a highlight, Haralick as the only ones that constituted the solutions for NHL.
3.2. Performance Overview against Different Approaches
The best performances were observed respecting those obtained via traditional CNN architectures applied directly to the H&E-stained histological images. ResNet-50 [
54], VGG-19 [
53], Inception v3 [
83], DenseNet-121 [
84] and Efficient-Net-B2 [
85] were the models tested in this process, using the following: a fine-tuning process; cross-validation k-folds, with
; 10 epochs; the stochastic gradient descent algorithm; initial learning rate of 0.01, decaying by 0.75 every two epochs; and, loss function as cross-entropy. Similar experiments were described by [
29]. Also, the CNN models were applied using the transfer learning strategy [
82], considering each network pre-trained on the ImageNet dataset [
29,
32]. Thus, each dataset with a reduced number of examples was investigated using each model after a fine-tuned process, mapping the last corresponding layer of each architecture with the groups available in each H&E dataset. The final connections with their weights were updated based on the total number of classes in each context, ensuring appropriate results without overfitting. In addition, the input images were normalized according to the mean and standard deviation values of the ImageNet dataset [
107]. The accuracy values the networks provide are shown in
Table 7.
To understand the differences between the distinction rates of the models proposed here and those obtained via networks applied directly, we consider the average accuracy values achieved in each H&E-stained histological dataset, as summarized in
Table 6. Thus, it is possible to verify that the accuracy values via the proposed approach overcome those provided by the ResNet-50, VGG-19, Inception v3, DenseNet-121, and EfficientNet-B2 networks. The classification rates with the convolutional networks ranged from 74.27% to 98.89%. Therefore, the gains in accuracy ranged from 0.6% to 16%, approximately. The smallest gain occurred in the CR dataset (0.58%) and the largest (16.45%) occurred in the NHL set. For example, we increased the classification rate in the NHL dataset, which involves three classes, from 74.27% to 90.72%, illustrating an additional contribution of this study.
Regarding the noted differences, the Friedman test was applied to verify if these solutions are statistically relevant. The Friedman test is a non-parametric statistical method capable of ranking the solutions under investigation, where the best option is set at the first position [
108]. This type of test allows us to observe the variance of repeated measures and to analyze whether the existing differences are statistically significant via
p-values. The smaller the
p-value, the greater evidence that the difference is statistically relevant. It is possible to indicate that there is some relevant difference when the
p-value is less than
. In the experiments carried out here, the resulting
p-value was 0.0004, indicating that the differences between the solutions are statistically significant.
In addition, the Friedman test ranks the solutions as a table. The result involving the experiments is displayed in
Table 8, with Friedman’s score indicated. The solutions obtained in this study are the most relevant for each dataset. It is important to note that when applying the Friedman test, each dataset represents a different sample (row) in relation to the corresponding solution. Each performance obtained through a solution in an H&E set has a rank value assigned based on the order of the best solutions. In the case of a tie, average ranks were assigned to the solutions. In each column, Friedman’s score was calculated as the average of the ranks of the samples, providing a final score for each solution [
108].
Finally, we believe that the heterogeneous ensemble of classifiers was another relevant factor in achieving the results listed previously. We were able to define a combination of algorithms that supported the pattern recognition process of different types of H&E images with a more robust and reliable system capable of covering the weaknesses that may exist in a single classifier. In addition, we believe that the bias and variance have been reduced, minimizing the overfitting. More comparisons or algorithms could be considered to indicate the possible limits of each solution or even whether the main combinations are maintained from more descriptors or selection methods. However, the set of techniques with their associations and experiments described here provided an important overview of the potential and discriminative capacity regarding H&E-stained histology images.
4. Conclusions
In this work, an ensemble learning method was elaborated through multiple descriptors (handcrafted and deep-learned features), a two-stage feature selection, and a classification process with five algorithms (heterogeneous ensemble). The approach was utilized to categorize H&E histological images that are representative of various datasets, such as colorectal cancer, liver tissue, oral dysplasia, and non-Hodgkin’s lymphomas.
The best ensembles indicated average accuracy values ranging from 90.72% (NHL) to 100% (CR). Since the initial feature set was composed of 11,086 values (462 handcrafted descriptors and 10,624 deep-learned features), the best solutions used a maximum of 53 features, with the following scenarios being noteworthy: CR with only 21 descriptors via bGWO; OED with only 11 descriptors via PSO; LA with 40 descriptors via bGWO; LG with only 29 attributes, through the bGWO; NHL with 53 descriptors, via GA. A breakdown of the main descriptors was also presented. It was observed that deep-learned descriptors predominated in relation to handcrafted ones, especially in the solutions for the LG and NHL datasets, with occurrences of 65.17% and 97.16%, respectively. On the other hand, the best solution for the LA dataset involved only handcrafted attributes. Another interesting behavior regarding handcrafted attributes is that the improved version of the Higuchi method outperformed the occurrences of important fractal techniques, specifically and , indicating the potential of the descriptor in multiple H&E-stained histological datasets. In addition, the handcrafted features with the highest occurrences were lacunarity (45.80%, LA dataset), percolation (46.36%, OED dataset), LBPs (39.60%, LG dataset) and Haralick (100%, NHL dataset). The indications of solutions, attributes, and occurrences represent important contributions of this study, since the composition of each model and the conditions involved are available to specialists interested in these issues.
When comparing the optimal outcomes with those achieved through CNN architectures applied directly to the H&E-stained histological datasets, it is noted that the proposed approach presented a superior performance in all conditions explored here. Moreover, regarding the performances available in the specialized literature for the same image contexts, the proposal provided the best solutions in three (CR, OED, and LG) of the five datasets, exploring from 11 (OED) to 29 (LG) features. Therefore, these results confirm the proposal as a robust baseline approach capable of providing models without overfitting, offering valuable insights for the assessment and enhancement of CAD systems tailored explicitly for H&E samples, particularly those representing CR, OED, and LG.
Finally, some issues concerning the proposed approach deserve attention. For instance, the effectiveness of parameter tuning, algorithm inclusion, and attribute selection methods may heavily depend on the dataset explored. The solutions may not generalize well to other types of histological images. Also, the success of applying metaheuristics and other algorithms relies on their suitability for the given problem. Biases might arise if specific algorithms are more effective due to the nature of the data, potentially favoring certain types of classifiers. Finally, the use of cutoff points for attribute selection via the ReliefF algorithm introduces a subjective element. The chosen cutoff points could impact the definition of best attributes, leading to potential biases based on the selected thresholds.
In future work, we intend to investigate the following: the limits and impacts on the best ensembles after applying parameter tuning methods for metaheuristics, including other algorithms; a scheme that aims to understand why the features were selected, in addition to which of them are most important for the classification process; influences of cutoff points to define the best attributes via the ReliefF algorithm (first stage of selection); the discriminative power of handcrafted attributes and corresponding ensembles based on quantifications of explainable artificial intelligence representations, specifically gradient-weighted class activation mapping and locally interpretable model-agnostic explanations; the discriminative capacity of these combinations and conditions in other H&E-stained histological images; comparisons of the main results with other existing methods or algorithms commonly used in the analysis of histological images.