1. Introduction
Allergic disease is considered a global health concern [
1] and cited by the World Health Organization (WHO) as one of the three major diseases of the 21st century [
2]. Pollen allergy, commonly known as hay fever [
3], has become the most widespread allergic disease as the continuous advancement of urban afforestation. A recent report [
4,
5] indicates a general increase in the incidence of pollen allergy with a profound socio-economic impact. Approximately more than 10% of adults and 40% of children worldwide suffer from pollen allergy [
4], and the annual cost for allergy management and drug therapy reaches up to billions of dollars [
5]. Once allergic pollen comes into contact with the oral, nasal, or eye mucosa of allergic patients, it will cause a spectrum of clinical symptoms (e.g., allergic rhinitis, exacerbating asthma and conjunctivitis, etc.) [
6]. Early identification and treatment can effectively enhance symptom control of pollen allergy and maintain the life quality of allergic patients. Therefore, it is essential to provide accurate pollen concentration monitoring information on the occurrence of airborne pollens for highly allergic individuals.
The standardized workflow for pollen concentration monitoring has been established in many countries, including twin pillars of pollen sampling and pollen identification [
7,
8]. In China, the Durham pollen sampler [
9] has achieved significant success for pollen monitoring and the Hirst volume sampler [
10] is regarded as a biomonitoring gold-standard device in Europe. The Burkhard [
11] and Lanzoni samplers [
12], based on the Hirst sampler, are also widely used in other countries. The procedure of pollen identification is uniform across countries, which always requires visual recognition of each pollen taxon under the microscope by well-experienced specialists. However, pollen identification is accepted as the most labor-intensive and time-consuming part of the concentration monitoring step. This process involves a fully manual observation and takes about half the time of the whole pollen monitoring process [
13], which may result in a delay of relevant risk alerts and negative health effects for allergic individuals.
The automatic pollen identification (API) task aims to accelerate the process of providing up-to-date concentration information of pollen grains to allergic sufferers [
14]. The API problem was first stated by Stillman and Flenley [
15] more than 20 years ago. It gained a great deal of attention as soon as it was proposed in the palynology community. The API is expected to be applied in some cases to work on a practical operation, especially various computer-aided systems. The current mainstream of API studies is mainly based on Scanning Electric Microscope (SEM) [
16,
17,
18,
19] and Light Microscope (LM). SEM images usually have high resolution, which makes the pollen grains differentiate obviously. Even though this is technically feasible, the expensive cost and strict requirements for sample preparation limit its practical application [
20]. By contrast, LM is more common in meteorological monitoring stations due to its simple operation, convenient actual deployment, and inexpensive cost [
21,
22]. Therefore, the LM-based API task has received increased attention from researchers.
Advances in computer vision techniques have further promoted the development of LM image-based API research [
23]. Many scholars in this area have reached a consensus that classification and detection models can independently address API problems in a one-stage manner. The former classification model focus on categorizing all pixels in a given image into specific pollen classes based on semantic content [
24,
25,
26,
27,
28,
29,
30,
31]. One of the significant successes is the classic work [
29], which extracts shape and texture descriptors and then generates Support Vector Machine (SVM)-based classifiers. These approaches heavily rely on artificial prior, making the classification process subjective. The deep learning-based models can automatically classify pollen grains without any prior knowledge [
30,
31]. Convolutional neural network (CNN), a core branch of a deep learning network, is widely developed and obtains impressive results [
32,
33,
34,
35,
36]. For example, Sevillano et al. [
30] presented three deep learning-based models, which show 97% accuracy on the POLEN23E pollen dataset. However, these classification models focus only on the category information excluding the positioning information of the objects. Besides, they are always built upon a specific assumption, namely, the image dataset only involves purified and isolated pollen grains. The real-world pollen WSIs invariably contains more complex impurities, unlike idealized experimental data. When they are directly input into the model, the underlying assumptions will not be met, leading to a serious misclassification.
Another dominant solution is to implement automatic pollen monitoring programs based on object detection models. The detection model can be seen as the generalized version of classification models. One notable property of these models is that localization and classification tasks can be integrated into the model as parallel branches (trained in an end-to-end manner). That is, the position information and category information of detected objects can be obtained simultaneously. The introduction of detection models has given birth to the rise of many breakthroughs in the API task. This method has shown excellent ability in distinguishing pollen grains from diverse and complex background features (coarse-grained: whether it is pollen). However, different allergic pollens share similar morphological structures such as shape and texture characteristics. It is extremely hard for detection models to focus on the detailed features that are highly useful for distinguishing pollen subcategories (fine-grained: which subcategory of pollen it belongs to, such as Cupressaceae, Pinaceae, etc.), which results in model over-detection and degradation of identification performance.
As a matter of fact, palynologists tend to identify the pollen grains in a progressive way instead of the above one-stage straightforward approaches. They usually focus on two pivotal problems when observing the pollen under the microscope: (i) Where are the pollen grains located? (pollen localization); (ii) Which categories do these pollen grains belong to? (pollen classification). Therefore, they generally follow a strategy of “localization first and then classification”, that is: the potential pollen regions will first be gazed at in the image; subsequently, the fine-grained categories of these pollens are determined by considering detailed features (as shown in
Figure 1). The first step aims to discover all the candidate pollen grains from other complex impurities, while the goal of the second step is to match the predicted pollen with the target pollen one by one, thus assigning a specific subcategory to each pollen grain. To mimic this natural processing way of human beings, the following twofold is fully considered to design our identification model in our study:
Inter-task dependence: From the observation process of “localization first and then classification”, we can infer that there are inherent associations between the detection and classification tasks. The localization information obtained from the detection procedure indicates potential regions of pollen grains, which is excellent guidance for the fine-grained classifiers to capture subtle discriminative patterns in specific regions. It can be expected that better identification performance will be achieved if we combine the detection with classification tasks in the computer vision community, for the information of the former task will make huge contributions to the latter task.
intra-task reliability: Influenced by the nature of pollen slide image (e.g., complex impurity information, similar pollen features, etc.), there are some bottlenecks within the localization and classification tasks. Specifically, it is difficult to find the potential pollen regions due to diverse and complex impurities information. Besides, the instances of different allergic pollen look similar in global appearance, which is easily wrongly recognized as other subcategories. Intuitively, the reliability performance boosting of each substage helps to enhance the overall identification accuracy. Therefore, specific consideration needs to be given to how to enhance the internal reliability in detection and classification tasks.
Inspired by the above considerations, we propose a novel progressive pollen identification model by incorporating localization and classification. Different from the existing API methods that focus only on a single task, our research not only fully considers the inherent inter-task correlation to combine the location information and classification details, but also effectively enhances the intra-task identification performance of each stage by introducing the multi-scale detection and multi-classifiers combination. Specifically, data preprocessing is first adopted to cut WSIs into specific patches and filter the useless patches containing blank backgrounds. Then, we leverage a multi-scale object detection model to detect informative regions that are highly correlated with pollen from the candidate patches (coarse-grained pollen localization). Finally, each region containing pollen grain is input into a multi-classifiers combination to obtain the final pollen identification results (fine-grained pollen classification). In this way, the localization information is served as region guidance for the classification stage, making the classifiers pay more attention to subtle features of local specific regions. Not only localization information extracted from pollen detection is considered, but also fine-grained categorical information such as texture, contour, and color learned from deep learning models are exploited. The contributions of this paper can be summarized as follows:
Considering the inter-task dependence of pollen detection and classification, we present a novel multilevel progressive learning to achieve automatic allergic pollen identification from real-world LM images. The pollen WSIs are performed by data preprocessing to filter the useless patches containing blank backgrounds. The coarse-grained localization provides the coarse position information that indicates the pollen region (targeting at “localization problem”), and the multi-classifiers combination is utilized to learn detailed discriminative features related to each pollen subcategory (targeting at “classification problem”).
Considering the intra-task reliability of pollen detection and classification, the multi-scale and multi-classifiers feature learning methods are introduced for reinforcing pollen identification performance. The multi-scale feature fusion helps to localize pollen regions from complex impurities by enlarging the receptive field, while the multi-classifiers feature representation combines different base classifiers in a parallel manner, making the model more effective in distinguishing different allergic pollen from each other.
Extensive experiments are conducted based on the real-world pollen dataset, which includes 2971 WSI images labeled with 8 + X categories (all other unknown pollen or debris are aggregated in an “X” category). Results of comparison experiments and ablation studies prove the effectiveness and superiority of our proposed method.
3. Results
In this section, we carry out extensive experiments to evaluate our proposed approach. The separate experiments are set up in a three-stage manner: background classification, coarse-grained pollen localization, and fine-grained subcategories classification. Besides, the overall identification performance was also estimated, which further reveals the prediction capability of our proposed method.
3.1. Experiment Settings
We construct three image subsets based on our APD-WSI dataset for training and testing each stage of the proposed model(they are named D1, D2, and D3), which are labeled by professional palynologists. Specifically, the first D1 dataset consists of 18,000 background images and 18,175 foreground images. This dataset is used to train and test CNN which is responsible for filtering background patches, where each image is labeled with a binary value of 1 or 0. The 10,324 impurities, 10,893 bubbles, and 11,500 pollen images are contained in the D2 dataset. These data are served for a coarse-grained detection model with the class labels and coordinates positions of all ground truth bounding boxes. The D3 dataset includes 10,080 images covering eight pollen subcategories, which is considered the training and testing data of the fine-grained classification task. The detailed information of the three datasets is shown in
Table 2.
Figure 6 exhibits the data distribution of D1, D2, and D3 datasets. In terms of the D3 dataset, the numbers of the Populus label and Sailx label are imbalanced with other categories. Such data augmentation operations are employed to balance data sampling, including random vertical flip, horizontal flip, and 90-degree rotation. The ten-fold cross-validation is applied in this paper for evaluation purposes.
To speed up model convergence, weight training on ImageNet is used as weight initialization in our experiments. For parameter training, the Stochastic Gradient Descent (SGD) was selected as the optimizer with a batch size of 16. The initial learning rate and momentum are set to 0.001 and 0.9, respectively. At each iteration, the loss of the model is recorded to show the variations observed during the model training. The maximal number of iterations is set to 100, which is the default value. The training is conducted until the validation loss no longer decreases between consecutive training cycles. Our proposed approach is implemented using the PyTorch framework.
The data preprocessing is responsible for cropping the image into patches and allowing the distinction of background and foreground objects by modeling a novel CNN. The different patch sizes (256, 512, and 1024 pixels) are used to train the proposed CNN model. The accuracies of the background classification model are 90.94%,94.82%, 89.52% with respect to patch sizes of 256, 512, and 1024 pixels. When the patch size equals 512, we obtain the best distinguishing ability. Accordingly, the patch size is chosen as 512 to generate patch sequences for subsequent pollen identification.
Table 2.
The detailed description of three sub-datasets for training and testing our proposed model.
Table 2.
The detailed description of three sub-datasets for training and testing our proposed model.
Dataset | Aim | Data Distribution (Class-Number) | Size |
---|
D1 | Data preprocessing | Background-18,000/Foreground-18,175 | 512 × 512 |
D2 | Coarse-grained pollen localization | Pollen-11,500/Impurity-10,324/Bubble-10,893 | 512 × 512 |
D3 | Fine-grained pollen classification | Artemisia-1640/Gramineae-1580/Chenopodiac-1930/Cupressaceae-1750/Pinaceae-1400/Populus-910/Sailx-870/Moraeeae-1020 | 100 × 100 |
3.2. Evaluation on Coarse-Grained Pollen Localization
The coarse-grained pollen localization described in
Section 2.4 can locate individual pollen in each image patch. Different from general detection models, the multi-scale feature is additionally considered to enhance the discriminative ability of pollen grains from complex backgrounds. In this section, we compare our multi-scale detection model with other state-of-the-art detectors. The Average Precision (AP) is measured to evaluate the detection performance in our experiment. The AP is the average precision over Intersection over Union (IoU) from 0.5 to 0.95 evaluated at steps of 0.05. The IoU is the ratio, ranging from 0 to 1, of the overlapping area of the ground truth and predicted areas to the union area.
Figure 7 shows the graph interpretation of IoU and its formula is seen as Equation (
4). Additionally, the AP50 and AP75 are adopted in our experiments, which represent the average precision when the threshold is 0.5 or 0.75.
Table 3 shows the comparison of the detection performance between our multi-scale detector and other state-of-the-art detection models, including Fast RCNN [
52], Faster RCNN [
53], YOLO family [
54] and SSD [
55]. For a fair comparison, the other models are retrained over our D2 training set. As we can see in the experimental results, our proposed detection model has greater values of the AP, AP50, and AP75 compared with the other detection algorithm. Especially, the AP of our model is 0.034 higher than other best methods, which demonstrates the effectiveness of our approach. We additionally evaluate the impact of the multi-scale feature enhancement mechanism on pollen detection. To this end, we simply remove the feature pyramid construction, and then investigate the pollen localization results with and without the multi-scale feature fusion. The results are shown in
Table 4. As shown, the detection performance can be significantly enhanced by introducing multi-scale feature fusion. In contrast, the AP value will decrease once the multi-scale is not incorporated. Thus, the above results verify the necessity of feature pyramid construction for pollen localization in our approach.
3.3. Evaluation on Fine-Grained Pollen Classification
The fine-grained pollen classification aims to solve the problem of allergic pollen identification. Rather than current classification models, our module integrates multiple classifiers to empower the perception ability between different pollen subcategories. The classification performance is qualified by accuracy metric.
We first investigate the impact of different CNN-empowered classifiers on the classification performance of various pollen subcategories. The state-of-the-art classification networks represented by Alexnet, Vgg-16, ResNet-50, and DenseNet-121 are selected as our compared models in this experiment. Each network is designed as a binary classification task, i.e., whether the specific type of pollen grain exists or not. Here, we consider eight pollen subcategories as our main experimental targets, which is consistent with the dataset described earlier. The accuracy results of each model on different subcategories are shown in
Table 5. As we expected, different classifiers have varying abilities to identify various pollen subcategories. Specifically, the DenseNet achieves the best performance on Cupressaceae and Populus classification with an accuracy of 91.80% and 80.22%, yet it performs worse than the ResNet when identifying Chenopodiaceae and Sailx. Moreover, the optimal performance is obtained by leveraging VGG-16 on predicting Artemisia, Graminea, and Moraeea with 87.50%, 85.10%, and 72.10% accuracy, respectively. Regrettably, the AlexNet fails to show superior performance on across-category classification. These findings have implications for the designing strategy of multi-classifiers combination. Inherently, the choice of combining multiple classifiers usually is either homogeneous (e.g., using the same type of classification subcomponents) or heterogeneous (e.g., using different types of classification subcomponents). Due to variability in classification accuracies of different CNN-empower classifiers on specific classes, the heterogeneous structure is ultimately applied in our multi-classifiers combination to improve predictive performance.
We further explore the identification performance of our proposed multi-classifier combination. The standard classification models are used as the representative of the base classifiers in this section. Unlike the above-mentioned validation trial in
Table 5, one prominent experiment design is that the classification task is extended to multi-class classification. This contributes to assuring a fair comparison in our experiment. Both multi-classifiers combination and single multi-class classifiers are responsible for assigning the input image to one of the main pollen subclasses of interest to us (eight items are studied in our experiment). In detail,
Table 6 presents the results of comparative analyses, and
Figure 8 shows the distribution of ten-fold cross-validation experimental results of our proposed model. It is apparent that our multi-classifiers combination obtains significant identification performance than the other models across all categories.
3.4. Evaluation on Overall Identification Performance
In the previous subsection, we have demonstrated that intra-task reliability can be improved by multi-scale feature enhancement and multi-classifiers integration. Further detailed experiments are performed in this section to validate the superiority of leveraging inter-task dependence. In our approach, the final prediction results are generated by progressively progressing three different stages, i.e., data preprocessing, coarse-grained pollen localization, and fine-grained pollen classification. Additionally, we add a direct sampling that only involves cutting operation without patch filtering to compare with our data preprocessing method.
Table 7 shows the classification performance from four cases, effectively expressing the inter-task dependence. The dispersion of ten-fold cross-validation ablation experimental results of our model are shown in
Figure 9. We can observe that the accuracy of “Data preprocessing+ Coarse-grained localization+ Fined-grained classification” is higher than that of “Direct sampling+ Coarse-grained localization+ Fined-grained classification”. It has proved that the introduction of patch filtering contributes to improving identification accuracy. This view is also verified in the accuracy comparison between “Data preprocessing+ Fined-grained classification” and the “Direct sampling+ Fined-grained classification” combination. As shown in
Figure 10, the examples are all mistaken for pollen in “Direct sampling+ Fined-grained classification”, but it can be correctly detected as background and successfully filtered out by the patches filtering in “Data preprocessing+ Fined-grained classification”. Given the experiment results, the reinforcing relation between image preprocessing and the subsequent tasks are obviously shown: the pretreatment for the original image could effectively minimize the noise interference from complex background. It is of great significance for overall performance enhancement. The guidance relationship between the localization task and classification task is explored as well. As we have seen in
Table 7, the “Data preprocessing+ Coarse-grained localization+ Fined-grained classification” combination boosts the categorization accuracy significantly, which brings 2.5% improvements compared with “Data preprocessing+ Fined-grained classification”. We found that “Data preprocessing+ Fined-grained classification” fails to correctly identify the samples of
Figure 11 as impurity class, but the “Data preprocessing+ Coarse-grained localization+ Fined-grained classification” combination can accurately detect them. It is not difficult to speculate that the localization information from the detection task can be served as region guidance for the classification stage. This makes the classifiers obtain better performance by focusing on subtle features of local specific regions.