4.1. Datasets
Three public datasets, namely CoNSep [
5], Lizard [
28], and PanNuke [
29,
30], are used to validate the performance of AER-Net. CoNSep and Lizard are colorectal cancer datasets, while PanNuke is a pan-cancer dataset containing a variety of different cancer types.
The CoNSep dataset is a colorectal nuclear segmentation and phenotypes dataset, consisting of 41 H&E stained images scanned at
magnification. Each image in the dataset has a size of
pixels. Referring to the setup of the paper [
5], the nuclear classes in the CoNSep dataset are divided into four classes which are epithelial, inflammatory, spindle-shaped, and miscellaneous. The training set consists of 27 images, while the test set comprises 14 images.
The Lizard dataset is the largest known dataset for nuclear segmentation and classification, comprising histology image regions of colon tissue from six different data sources. The dataset images are at
magnification and provide nuclear classification labels. The dataset is labeled with six different types of nuclei including neutrophil, epithelial, lymphocyte, plasma, eosinophil, and connective. For the Lizard dataset, we use the data provided in CoNIC 2022 [
6], which contains 4981 images with a size of
pixels. In order to better evaluate AER-Net, we divided 4981 images into 2988 as a training set, 996 as a validation set, and 997 as a test set by stratified random sampling.
The PanNuke dataset contains H&E staining images of 19 different tissue types with the nuclear classes divided into five classes. The dataset is organized into three folds, comprising 2656, 2523, and 2722 images, respectively. Each image has a size of pixels. We use fold1 as the training set, fold2 as the validation set, and fold3 as the test set.
4.3. Implementation Details
We implemented our framework using PyTorch on a workstation that has an NVIDIA GeForce RTX 3090 GPU (24G). In the experiments with the CoNSep dataset, the input size of our model and HoverNet is
pixels, and the output size is
pixels. For the experiments on the Lizard dataset, the input and output sizes of both AER-Net and HoverNet are
pixels. In the experiments on the PanNuke dataset, the input size of AER-Net and HoverNet is
pixels, and the output size is
pixels. During model training, various data augmentation methods such as flipping, Gaussian blurring, and median blurring are employed. The whole training process is divided into two stages each with 50 epochs, and the same settings are applied in each stage. In the first stage, we do not freeze the encoder but train it directly. The batch size is four for both the first and second stages of the training process. The model parameters are initialized with pre-trained weights from the ImageNet [
32] dataset, we use the Adam optimization method, and the initial learning rate is
, which will be reduced to
after 25 epochs. For the loss function parameters in the training of the CoNSep dataset, we set
to 2 and used default values for all other parameters. For Lizard and PanNuke dataset training, the loss function parameters all use default values. Note that for the Pannuke dataset,
utilizes only two pooling operations, which is because the output of the model for the Pannuke dataset is
pixels.
4.4. Experimental Results
We compare the AER-Net with other state-of-the-art methods on the CoNSep dataset, as illustrated in
Table 1. The experimental results for the Lizard dataset are shown in
Table 2. In addition, we compared the baseline (HoverNet) with the AER-Net on the PanNuke dataset to validate the predictive ability of our model on different tissues.
Firstly, experiments were conducted on the CoNSep dataset, and the results in
Table 1 reveal that AER-Net achieved the best performance on the CoNSep dataset. AER-Net achieved the highest
of 0.556, which is 3.5% higher than the second highest value. This result indicates that AER-Net achieves excellent segmentation performance on smaller datasets such as CoNSep. In the CoNSep dataset, there are four types of nuclei: miscellaneous, inflammatory, epithelial, and spindle-shaped nuclei, corresponding to
,
,
, and
in
Table 1. The number of nuclei in the miscellaneous class is lower, and hence the
is lower for many methods. Notably, AER-Net still performs well in classifying miscellaneous nuclei, achieving an
higher than the result of HoverNet. This indicates that our proposed method can effectively address the class imbalance of nuclei instances in the task of nuclei segmentation and classification. The comparison between AER-Net and HoverNet results is depicted in
Figure 6, where it can be observed that the segmentation results and the predictions of nuclear classification are more accurate in AER-Net compared with HoverNet.
We conducted experiments on a larger dataset, Lizard, and the experimental results are presented in
Table 2.
,
,
,
,
, and
denote the
classification scores for neutrophil, epithelial, lymphocyte, plasma, eosinophil, and connective nuclei, respectively. The experimental results show that AER-Net achieves a
of 0.608. It can be observed that AER-Net demonstrates the best performance in terms of
,
,
,
,
,
, and
. In the Lizard dataset, the number of nuclei in the eosinophil and neutrophil classes is significantly smaller compared with the nuclei in the other classes. It is noteworthy that AER-Net demonstrated significant improvement in the eosinophil and neutrophil classes with a
improvement in
and a
improvement in
. This further validates the potential of AER-Net for nuclear classification.
Figure 7 illustrates the results of AER-Net compared with HoverNet. It can be observed that despite the numerous challenges in the image, AER-Net still achieves excellent results.
A comparison of
Table 1 and
Table 2 reveals that with the increase in the dataset, AER-Net can also achieve better results on the nuclei segmentation and classification tasks. In the small-scale dataset CoNSep, the number of samples is limited, and AER-Net can utilize the advantages of the residual refinement module to significantly improve the results of nuclei segmentation and classification. In large-scale datasets, where the diversity and complexity of data increase, AER-Net still performs well. Although the improvement of segmentation performance is not obvious in the Lizard large-scale dataset, the classification performance is still significantly improved, which shows that AER-Net can adapt to data of different scales and complexity with stronger generalization ability.
In order to further validate the segmentation and classification ability of AER-Net for nuclei of different tissues, we performed experiments on the PanNuke dataset, as shown in
Table 3.
,
,
,
and
denote the
classification scores for neoplastic, inflammatory, connective tissue, dead and non-neoplastic nuclei, respectively. It can be seen that AER-Net achieved the highest
of 0.644, indicating that AER-Net is also competitive in nuclear instance segmentation across different tissues. Moreover, as depicted in
Table 3, it can be observed that AER-Net demonstrates improved performance in nuclear classification compared with HoverNet. In conclusion, AER-Net still achieves excellent performance in nuclear segmentation and classification in different tissues. The
scores take into account the type of nuclei, so while the improvements in the
and
metrics are not significant, the improvement in the
scores for each class also demonstrates the enhancement of the AER-Net in the overall task of nuclear segmentation and classification.
4.5. Ablation Study
To analyze the performance of each module, we further conducted ablation experiments on the CoNSep dataset. We used the same experimental settings for each method in the ablation experiments to ensure a fair comparison.
denotes HoverNet,
denotes an encoder that utilizes the
module based on the Baseline, and
indicates the use of the
module based on the baseline.
is our model ARE-Net including
and
. The results of our experiments are presented in
Table 4. A powerful encoder is essential for nuclear segmentation and classification, and the experimental results show that the channel and spatial attention-enhanced encoder we used significantly improves the predictive performance of the model. The best performance of the model is observed when both the attention-enhanced encoder and the attention-enhanced residual refinement module are used.
We also performed ablation experiments on the weights of the auxiliary loss in the loss function, where
denotes the auxiliary loss. The experimental results are shown in
Table 5, and it is observed that the best performance is achieved when
is set to 1. It should be pointed out that the weight of auxiliary loss is only a hyperparameter, which should be adjusted accordingly for different datasets.
In addition, we tried to adjust the weights of the loss function for each branch in Equation (
2). The experimental results are shown in
Table 6. It can be seen that the results are better when the weights of
,
,
are all 1 that is our used in the comparative experiments and independent validation.