1. Introduction
As computer technology continues to advance and material conditions improve, artificial intelligence (AI) technologies are constantly evolving. Natural Language Processing (NLP) is progressing alongside AI at a rapid pace. NLP technology encompasses all processes of using electronic devices to process natural language. The purpose of this technology is to enable computers to correctly perceive, process, and apply the human language input, thereby achieving many complex functionalities. The NLP technical framework can be divided into three levels: small-scale, including word-level NLP techniques; medium-scale, including syntactic-level NLP techniques; and large-scale, including discourse-level NLP techniques. Named Entity Recognition (NER) is a relatively small-scale branch of NLP, specifically at the global level. Its main function is to identify and extract entity names from sentences or articles, forming the foundation for applications like knowledge graphs, data mining, question-answering systems, and machine translation. Chinese Named Entity Recognition (NER) tasks involve extracting the required entities from Chinese texts. Different recognition tasks focus on extracting different types of entities. For example, course entity recognition focuses on identifying entities like course names, teacher names, and knowledge point entities, while news entity recognition involves identifying entities like person names and place names.
Mathematical knowledge entity recognition focuses on identifying concepts such as angles, lines, and planes; methods like “right angle”, “sequence”, and “set”; and theorems like “Pythagorean theorem” and “completing the square”. These recognized entities can be used to construct knowledge graphs which support tasks such as knowledge-based question answering and MOOC course recommendations. Taking course recommendation as an example, knowledge point entities are extracted from course descriptions, videos, and other materials to build the course’s knowledge graph. From the user’s learning activities (such as courses viewed, quizzes, and classroom discussions), their required knowledge is identified, and personalized recommendations are made based on the knowledge graph, improving the user experience and enhancing learning outcomes. Therefore, the accuracy of entity recognition directly affects the effectiveness of these subsequent tasks, making the improvement of recognition accuracy a key research focus.
To address this, the latest pre-trained language model, LERT (Linguistically Motivated Bidirectional Encoder Representation from Transformers), is used to obtain semantically rich word vectors. By combining LERT’s powerful language representation capabilities with Bidirectional Gated Recurrent Units (BiGRUs), Iterated Dilated Convolutional Neural Networks (IDCNNs), and Conditional Random Fields (CRFs), the model’s ability to capture global contextual information is enhanced, thereby improving the accuracy of entity recognition.
2. Materials and Methods
Over time, Named Entity Recognition (NER) technology has evolved from dictionary-based rule techniques to traditional machine learning and deep learning methods.
Early methods primarily relied on rule-based and dictionary-based approaches. Researchers such as Kim J. H., Riaz K., and Xiaoheng Zhang used this technique in their respective tasks. However, this method depends on specific rules for entity recognition, and the richness of the dictionary is often insufficient, leading to ambiguities between words. The process of constructing rules is complex, requiring researchers to have a deep understanding of linguistic knowledge. Additionally, different languages have different grammatical structures, which means that language-specific rules must be developed. These rules often conflict with each other, requiring careful management. As a result, the workload for researchers increased significantly, as they had to continually revise old word sets and rules, which eventually led to these methods being replaced by more advanced machine learning techniques.
Traditional machine learning methods for Named Entity Recognition (NER) mainly include the Hidden Markov Model (HMM), the Maximum Entropy Markov Model (MEMM), and the Conditional Random Field (CRF). These approaches are primarily based on statistical probabilities and are essentially sequence labeling tasks. They require large corpora for training, where the model learns to label the input language based on the provided data. Zhao [
1] applied HMM in the recognition of biomedical texts, achieving an accuracy score of 62.98% using a word-similarity-based smoothing method. Wang and others applied MEMM to address extraction, leading to significant improvements in both precision and recall. Lafferty et al. [
2] proposed the CRF, a discriminative classifier that builds models for decision boundaries between different classes and can be used for classification after training. Chen [
3] used the CRF for Chinese NER recognition, achieving a score of 85.25 on the MSRA dataset. Khabsa M. [
4] applied the CRF to chemical entity recognition and obtained an F1 score of 83.3%. However, these methods heavily depend on the corpus, requiring careful data selection, processing, and the construction of effective features. The choice of features directly impacts the model’s performance, and this process requires considerable human effort and time. Additionally, these methods tend to have slow convergence and lengthy training times, which further adds to the challenge.
With the continuous development of machine learning, a wide variety of models and algorithms have been introduced to solve various problems. Deep learning methods based on neural networks have gradually become dominant in NER tasks. Recurrent Neural Networks (RNNs) [
5] have shown great effectiveness in addressing sequence modeling problems. However, RNNs tend to focus more on later outputs, leading to issues like gradient vanishing or exploding. To address this, Hochreiter et al. [
6] proposed Long Short-Term Memory (LSTM), which selectively utilizes long-term sequence information through a gating mechanism (the input gate, output gate, and forget gate), retaining useful long-sequence information and mitigating the issues present in RNNs. Zeng D et al. [
7] combined LSTM with the CRF for drug entity recognition tasks. On top of LSTM, the Gated Recurrent Unit (GRU) retains two gate structures (update gate and reset gate), reducing the number of parameters in LSTM, effectively lowering training costs and minimizing the risk of overfitting in BiLSTM. By combining forward and backward LSTMs and GRUs, Bidirectional LSTM (BiLSTM) and Bidirectional GRUs (BiGRUs) are created, which capture both preceding and following contextual information in sequences, thereby improving NER performance. Wu et al. [
8] applied the BiLSTM-CRF with attention to the Chinese electronic medical record NER. Quinta et al. [
9] optimized the BiLSTM-CRF for Portuguese corpora, achieving high F1 results. Qiu Qinjun et al. [
10] proposed an attention-based BiLSTM-CRF neural network, achieving an F1 score of 91.47% in geological NER tasks. Convolutional Neural Networks (CNNs), compared to RNNs, are more commonly used in image modeling. In text processing, CNNs may only capture a small portion of the original data through convolutions, and increasing the number of CNN layers to improve accuracy results in a significant increase in parameters, which also increases training costs and leads to overfitting. Emma Strubell et al. [
11] proposed the Iterated Dilated Convolutional Neural Network (IDCNN) based on the Dilated CNN (DCNN). As the depth of the DCNN increases, the effective input width expands exponentially, quickly covering the entire length of the input sequence. During dilation, it captures rich local information that BiLSTM and BiGRUs may overlook. The depth of the DCNN network increases linearly, avoiding the exponential growth in parameters that would occur with increasing CNN layers, thus preventing the problem of parameter explosion. The IDCNN iteratively applies the dilated convolution blocks multiple times, without adding extra parameters, effectively mitigating the overfitting issues caused by simply increasing the depth. Yu Bihui et al. [
12] achieved an F1 score of over 94% in entity recognition using the IDCNN. Although these methods have achieved some success in the field of NLP, when dealing with phrases or even sentences, they often overlook the semantic relationships between words and their contexts, especially in Chinese, where the same word can have different meanings in different contexts (polysemy). This limits the model’s ability to accurately recognize entities, affecting overall recognition accuracy.
The BERT (Bidirectional Encoder Representation from Transformers) [
13] model, introduced by Google AI in 2018, is a bidirectional encoder based on the Transformer architecture. BERT significantly enhances the relational features between characters, words, and sentences, allowing us to better understand information in different contexts. The word vectors generated by BERT have much stronger semantic representation capabilities. Additionally, during training, BERT does not require manual intervention from researchers, and different functionalities can be implemented without major modifications to the code framework, which greatly reduces training costs. BERT is a pre-trained language representation model, and its pre-training phase includes two tasks: a Masked Language Model (MLM) and Next-Sentence Prediction (NSP). These tasks strengthen the model’s learning of word and sentence relationships. Gao et al. [
14] applied BERT for sentiment analysis and achieved the best results compared to traditional models. Wu Jun et al. utilized BERT embeddings combined with BiLSTM and the CRF for Chinese terminology extraction, demonstrating a clear improvement over traditional shallow machine learning models. Zhang Yi et al. combined BERT with the BiLSTM-IDCNN-CRF, achieving an F1 score of 93.91% for elementary mathematical entity recognition. Yang Chonglong et al. [
15] used BERT to build word vectors, followed by BiLSTM, and then improved the IDCNN. CRF decoding was then used to create an excellent COVID-19 entity recognition model. However, BERT has some limitations. The tasks used during pre-training do not appear in downstream tasks, which can lead to a mismatch between pre-training and fine-tuning, potentially affecting BERT’s performance in downstream NLP tasks.
Building on BERT, many other pre-trained language models have emerged. Kevin Clark et al. [
16] identified that BERT’s pre-training learning efficiency was relatively slow and proposed the Electra language model. Electra modifies the MLM strategy by replacing the original tokens with generated tokens, instead of using masking. This approach increases training speed and improves accuracy in downstream tasks. MacBERT [
17] replaced the original MLM task with a corrected MLM task, where similar-meaning words are used to replace the original words. Additionally, it changed the NSP task to the Sentence Order Prediction (SOP) task to reduce the gap between BERT’s pre-training and downstream tasks. PERT [
18] uses the PerLM method, which employs Whole-Word Masking (WWM) and N-gram masking to select words, changing the order of characters and words in sentences. The model’s goal is to restore the word order from the shuffled sentence. Recently, Cui Yiming et al. [
19] proposed the LERT pre-trained language model, which injects linguistic knowledge during the pre-training phase. Specifically, it uses the LTP language analysis tool to generate the following three linguistic features: Part of Speech (POS) tagging, Named Entity Recognition (NER), and Dependency Parsing (DEP). These features are combined with the Masked Language Model (MLM) task to perform multi-task pre-training. By incorporating linguistic features, LERT possesses more powerful language representation capabilities and aligns more closely with downstream tasks, providing strong support for NLP tasks.
In this experiment, LERT will be used to generate word vectors rich in semantic information. The BiGRU and IDCNN will capture long-range dependencies and global information, while the CRF will decode the entity labels by leveraging the dependencies between labels. Together, these four components form a fast-training and highly accurate mathematical knowledge entity recognition extractor.