AWdpCNER: Automated Wdp Chinese Named Entity Recognition from Wheat Diseases and Pests Text
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset Construction and Characteristic Analysis
2.1.1. Corpus Collection and Pre-Processing
2.1.2. Corpus Annotation
2.1.3. Analysis of Corpus Characteristics
- The boundary features of some entities in the WdpDs dataset were not obvious, which meant they were easily broken down incorrectly. For example, “3% clozophos (Mille) granules” was a typical example.
- The entity structure of wheat diseases and pests was complex, and some entities were composed of numbers, letters, and Chinese characters. For example, “Xinong 6082”, “75% Malathion oil”, and some other entities were typical examples.
- There was nesting among some entities in the corpus of wheat diseases and insect pests. For example, the pathogenic entity “wheat yellow dwarf virus” was nested in the disease entity “wheat yellow dwarf disease”, and so on.
- The dataset of wheat diseases and pests contained many entity categories. The constructed WdpDs dataset contained 21 types of entities, more than the JE-DPW dataset in the same domain [16].
2.2. AWdpCNER Model
2.2.1. Data Augmentation
- Data augmentation Method 1 (DA1): Under the condition of maximum guarantee of sentence sequence integrity, the text paragraphs in the original dataset were randomly shuffled, and the shuffled paragraphs were copied back to the original dataset.
- Data augmentation Method 2 (DA2): An entity was randomly selected from the wheat diseases and pests text data, and then a synonym of the entity was randomly selected from the constructed domain dictionary, WdpDict, for replacement, and the replaced text data was copied back to the original dataset.
2.2.2. ALBERT Layer
- Factor word embeddings: In the ALBERT model, the one-hot vector was mapped to a low-dimensional space first, and then to the hidden layer. The complexity transformation of the parameter number calculation from the BERT model to the ALBERT model is shown in Equation (1):
- Cross-layer parameter sharing: In ALBERT, the parameters were shared in both the full connection layer and the attention layer, that is, all parameters in the Encoder were shared, which greatly reduced the number of model parameters and improved the training speed, but the reduced number of model parameters also degraded its performance.
- Sentence Order Prediction: In order to compensate for the performance loss caused by the reduction in the number of parameters, ALBERT proposed the inter-sentence coherence prediction SOP (sentence order prediction) to improve the model performance. Different from the original NSP (next sentence prediction) task of the BERT model, SOP removed the influence of topic prediction and only preserved the relational consistency prediction.
2.2.3. BiLSTM Layer
2.2.4. CRF Layer
- The first word in a sentence always begins with the label “B-” or “0”, not “I-”.
- In the label “B-label1 I-label2 I-label3 I-...”, label1, label2, and label3, should belong to the same entity. For example, “B-DIS I-DIS” was a valid label sequence, while “B-DIS I-DRU” was an invalid label sequence.
- The first label of the entity should start with “B-, not “I-”. For example, “O I-DIS” was a valid label sequence, while “O I-DIS” was an invalid label sequence.
2.2.5. Rules Amendment
- For pest and disease entities, if harmful crops appear before them, they shall be labeled as a whole. For symptom entities, if organs appear in the adjacent vocabulary, the whole entity is modified to organ symptom type entity. In the process of rule correction, a sliding window with the size of 1 was set, centering on the keyword to search an entity for the context. If the adjacent prediction labels were entity terms, the corresponding rules were found and merged into related entities. Otherwise, the original word prevailed. The specific rules are shown in Table 2.
- Diseases often ended with the word “disease”. The last word of this type of entity was concatenated with the next word immediately adjacent to it. If a whole word could be formed, it would be regarded as the whole prediction. The drugs were usually composed of their concentration and drug name. Regex was written to recognize numbers, symbols, and Chinese as a whole. The specific rules are shown in Table 3.
- All the predictions of the AWdpCNER model are amended, such as the wrong prediction that label1 and label2 are different types of entities in “B-label1 I-label2” and the beginning of “I-label”.
3. Results
3.1. Experimental Parameters Setup
3.2. Experiment Results
3.2.1. Performance Comparison of Different Models
3.2.2. Comparison of Results of Different Data Augmentation Methods
3.2.3. Entity Recognition Results for AWdpCNER Model
3.2.4. Comparison of Entity Recognition Results before and after Rules Amendment
4. Discussion
5. Conclusions
- Aimed at the problems of Chinese named entity recognition in the field of wheat diseases and pests, including the lack of training data, many proper nouns, diverse entity categories, and uneven entity distribution, the AWdpCNER model was proposed. The model combined two data augmentation methods to expand the semantic information of sentences, which improved the accuracy of the model for a small number of entity categories, and effectively solved the problem of named entity recognition in the case of small samples. The model recognition precision is 94.76%, the recall is 95.64%, and the F1-score is 95.29%.
- The dynamic word embedding vector was obtained based on the lightweight ALBERT model, which could capture the entity context features, enrich the semantic representation of wheat disease and pest text, effectively alleviate the problem of polysemy representation, and improve the model recognition performance.
- The specific rules were defined to modify the prediction results of the ALBERT-BiLSTM-CRF model. The experiment proved that the rule amendment alleviated the problems of fuzzy entity boundary and nesting among entities and, to a certain extent, optimized the model performance.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Algorithm A1 The pseudocode of wheat disease and pest named entity recognition task |
Input: wheat disease and pest sentence S, and its ground-truth labels Y. |
Output: the predicted entity labels, the best training weight of the model |
|
References
- Ren, Y.; Yu, H.; Yang, H.; Liu, J.; Yang, H.; Sun, Z.; Zhang, S.; Liu, M.; Sun, H. Recognition of quantitative indicator of fishery standard using attention mechanism and the BERT+BiLSTM+CRF model. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2021, 37, 135–141. [Google Scholar]
- Wang, Y.; Zhang, C.; Bai, F.; Wang, Z.; Ji, C. Review of Chinese Named Entity Recognition Research. J. Front. Comput. Sci. Technol. 2023, 17, 324–341. [Google Scholar]
- Liu, X.; Zhang, M.; Gu, Q.; Ren, Y.; He, D.; Gao, W. Named Entity Recognition of Fresh Egg Supply Chain Based on BERTCRF Architecture. Trans. Chin. Soc. Agric. Mach. 2021, 52, 519–525. [Google Scholar]
- Yang, Y.; Li, Y.; Zhong, X.; Xu, L. Named Entity Recognition of TCM Medical Records Based on BiLSTM-CRF. Inf. Tradit. Chin. Med. 2021, 38, 15–21. [Google Scholar] [CrossRef]
- Xu, L.; Li, J. Biomedical named entity recognition based on BERT and BiLSTM-CRF. Comput. Eng. Sci. 2021, 43, 1873–1879. [Google Scholar]
- Shen, T.; Yu, L.; Jin, L.; Huang, F.L.; Xv, H.Q. Chinese entity recognition based on BERT-BiLSTM-CRF model. J. Qiqihar Univ. (Nat. Sci. Ed.) 2022, 38, 26–32. [Google Scholar]
- Malarkodi, C.S.; Lex, E.; Devi, S.L. Named Entity Recognition for the Agricultural Domain. Res. Comput. Sci. 2016, 117, 121–132. [Google Scholar]
- Guo, X.; Zhou, H.; Su, J.; Hao, X.; Tang, Z.; Diao, L.; Li, L. Chinese agricultural diseases and pests named entity recognition with multi-scale local context features and self-attention mechanism. Comput. Electron. Agric. 2020, 179, 105830. [Google Scholar] [CrossRef]
- Yan, L. Automatic Question Answering System for Grape Diseases and Pests Based on Knowledge Graph. Master’s Thesis, College of Information Engineering Northwest A&F University, Yangling, China, 2021. [Google Scholar] [CrossRef]
- Yu, H.; Shen, J.; Bi, C.; Liang, J.; Chen, H. Intelligent diagnostic system for rice diseases and pests based on knowledge graph. J. South China Agric. Univ. 2021, 42, 105–116. [Google Scholar]
- Li, Y. Research on the Construction of Knowledge Graph of Crop Diseases and Pests. Master’s Thesis, Agricultural Information Institute Graduate School, Anyang, China, 2021. [Google Scholar] [CrossRef]
- Ren, N.; Bao, T.; Shen, G.; Guo, T. Fine-Grained Named Entity Recognition Based on Deep Learning: A Case Study of Tomato Diseases and Pests. Inf. Sci. 2021, 39, 96–102. [Google Scholar] [CrossRef]
- Zheng, Y.; Wu, H.; Zhu, D.; Chen, B.; Li, W. Question and Answer System Based on the Knowledge Graphs of Litchi and Longan Diseases and Insect Pests. Comput. Digit. Eng. 2021, 49, 2618–2622. [Google Scholar]
- Wang, Q.; Liu, Y.; Yang, N. Identification and Control of Wheat Diseases and Pests; Ningxia People’s Publishing House: Ningxia, China, 2009. [Google Scholar]
- Shang, H.; Wang, F. Atlas of Diagnosis and Control of Wheat Diseases and Pests; Jindun Publishing House: Beijing, China, 2015. [Google Scholar]
- Shen, L.; Jiang, H.; Hu, B.; Xie, C. A study on joint entity recognition and relation extraction for rice diseases pests weeds and drugs. J. Nanjing Agric. Univ. 2020, 43, 1151–1161. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
- Lafferty, J.; Mccallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning 2001, Williamstown, MA, USA, 28 June–1 July 2001; pp. 282–289. [Google Scholar]
- Strubell, E.; Verga, P.; Belanger, D.; McCallum, A. Fast and accurate entity recognition with iterated dilated convolutions. arXiv 2017, arXiv:1702.02098. [Google Scholar]
- Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
- Dai, Z.; Wang, X.; Ni, P.; Li, Y.; Li, G.; Bai, X. Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records. In Proceedings of the 2019 12th International Congress on Image and Signal Processing, Biomedical Engineering and Informatics (CISP-BMEI), Suzhou, China, 19–21 October 2019; IEEE: Toulouse, France, 2019; pp. 1–5. [Google Scholar]
ID | Entity Name | Entity Tags | Number of Entities | Examples |
---|---|---|---|---|
1 | Disease | DIS | 2539 | powdery mildew, scab |
2 | Disease class | DIS_CLA | 173 | Fungal disease, nematode disease |
3 | Pest | PES | 2090 | aphids, armyworm |
4 | Pest class | PES_CLA | 156 | Underground pests, leaf pests |
5 | Pests time cycle | PES_TIM | 1619 | adults, nymphs |
6 | Pathogeny | PAT | 407 | brucella gramineae |
7 | Pathogeny class | PAT_CLA | 311 | fungi, viruses |
8 | Wheat organ | ORG | 2363 | leave, stem |
9 | Drugs (Termiticides) | DRU | 1218 | triadimefon powder |
10 | Agricultural control | CON | 439 | resistant varieties, watering |
11 | Wheat growth time | TIM | 601 | jointing stage, grouting stage |
12 | Wheat variety | WHE | 456 | yumai 18, jimai38 |
13 | Wheat area | ARE | 276 | northwest spring wheat area |
14 | Symptom | SYM | 1036 | dry death |
15 | Organ symptoms | OSYM | 1837 | yellowing of the leave |
16 | Harmful crops | CROP | 1066 | Wheat, corn |
17 | Harmful region | REG | 701 | Henan, Zhengzhou |
18 | Genus | GEN | 203 | hemiptera, lepidoptera |
19 | Family | FAM | 201 | noctuidae, culicidae |
20 | Other name | OTHN | 240 | Wheat ear dry, oil worm |
21 | Enemy | ENE | 195 | Seven-star ladybird, hoverfly |
Rule Definition | Entity Examples |
---|---|
CROP + DIS = DIS | Wheat powdery mildew, gibberellic disease of corn |
CROP + PES = PES | Wheat tube thrips, corn aphids |
ORG + SYM = OSYM | Stem dry, leaves yellow |
Entity Label | Sentence | Rules | Results |
---|---|---|---|
DIS | Wheat blue dwarf disease virus belongs to virus Stripe rust disease attacks wheat | disease [\u4e00-\u9fa5] | disease virus/n disease attacks/empty |
DRU | 50% phoxim emulsion oil 20% kiku·horse emulsion | \d+(?:\.\d+)?(?:%)(?:[\u4e00- \u9fa5]+·)?(?=[\u4e00-\u9fa5]) | 50% 20% kiku· |
Model | P/% | R/% | F1-Score/% |
---|---|---|---|
Word2Vec-IDCNN-CRF | 85.49 | 87.29 | 86.38 |
Word2Vec-BiLSTM-CRF | 88.05 | 89.4 | 88.72 |
BERT-BiLSTM-CRF | 90.9 | 91.16 | 91.03 |
ALBERT-BiLSTM-CRF | 90.86 | 91.70 | 91.28 |
Model | P/% | R/% | F1-Score/% |
---|---|---|---|
ALBERT-BiLSTM-CRF | 90.86 | 91.70 | 91.28 |
DA1 + ALBERT-BiLSTM-CRF | 92.85 | 93.14 | 92.99 |
DA2 + ALBERT-BiLSTM-CRF | 91.46 | 92.31 | 91.88 |
DA1 + DA2 + ALBERT-BiLSTM-CRF | 93.88 | 95.2 | 94.54 |
ID | Entity | P/% | R/% | F1-Score/% |
---|---|---|---|---|
1 | Disease (DIS) | 95.24 | 93.33 | 94.28 |
2 | Disease class (DIS_CLA) | 88.89 | 100 | 94.12 |
3 | Pest (PES) | 92.77 | 95.06 | 93.9 |
4 | Pest class (PES_CLA) | 85.71 | 100 | 92.31 |
5 | Pest time cycle (PES_TIM) | 98.49 | 97.51 | 98 |
6 | Pathogeny (PAT) | 76.79 | 87.76 | 81.9 |
7 | Pathogeny class (PAT_CLA) | 87.5 | 91.3 | 89.36 |
8 | Wheat organ (ORG) | 91.32 | 93.08 | 92.19 |
9 | Drug (DRU) | 89.6 | 89.6 | 89.6 |
10 | Agricultural control (CON) | 79.01 | 82.05 | 80.5 |
11 | Wheat growth time (TIM) | 92.86 | 98.48 | 95.59 |
12 | Wheat variety (WHE) | 91.3 | 97.67 | 94.38 |
13 | Wheat area (ARE) | 87.5 | 100 | 93.33 |
14 | Symptom (SYM) | 85.31 | 93.13 | 89.05 |
15 | Organ symptom (OSYM) | 84.09 | 82.96 | 83.52 |
16 | Harmful crop (CROP) | 93.2 | 91.4 | 92.29 |
17 | Harmful region (REG) | 93.18 | 93.89 | 93.54 |
18 | Genus (GEN) | 100 | 100 | 100 |
19 | Family (FAM) | 100 | 100 | 100 |
20 | Other name (OTHN) | 89.47 | 85 | 87.18 |
21 | Enemy (ENE) | 76.47 | 89.66 | 82.54 |
ID | Entity Name | P/% | R/% | F1-Score/% |
---|---|---|---|---|
1 | Disease (DIS) | 96.71 | 93.54 | 95.1 |
3 | Pest (PES) | 94.71 | 95.06 | 94.88 |
9 | Drugs (DRU) | 92.05 | 89.6 | 90.81 |
15 | Organ symptoms (OSYM) | 86.18 | 83.86 | 85 |
Sentence | Before Rule Amendment | After Rule Amendment |
---|---|---|
Wheat powdery mildew is a type of … | powdery mildew | Wheat powdery mildew |
… leaves turned yellow … | leaves | leaves turned yellow |
…20% Chrysanthemum · Horse Emulsion… | Horse Emulsion | 20% Chrysanthemum · Horse Emulsion |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, D.; Zheng, G.; Liu, H.; Ma, X.; Xi, L. AWdpCNER: Automated Wdp Chinese Named Entity Recognition from Wheat Diseases and Pests Text. Agriculture 2023, 13, 1220. https://doi.org/10.3390/agriculture13061220
Zhang D, Zheng G, Liu H, Ma X, Xi L. AWdpCNER: Automated Wdp Chinese Named Entity Recognition from Wheat Diseases and Pests Text. Agriculture. 2023; 13(6):1220. https://doi.org/10.3390/agriculture13061220
Chicago/Turabian StyleZhang, Demeng, Guang Zheng, Hebing Liu, Xinming Ma, and Lei Xi. 2023. "AWdpCNER: Automated Wdp Chinese Named Entity Recognition from Wheat Diseases and Pests Text" Agriculture 13, no. 6: 1220. https://doi.org/10.3390/agriculture13061220
APA StyleZhang, D., Zheng, G., Liu, H., Ma, X., & Xi, L. (2023). AWdpCNER: Automated Wdp Chinese Named Entity Recognition from Wheat Diseases and Pests Text. Agriculture, 13(6), 1220. https://doi.org/10.3390/agriculture13061220