Few-Shot Learning Sensitive Recognition Method Based on Prototypical Network

Yuan, Guoquan; Zhao, Xinjian; Li, Liu; Zhang, Song; Wei, Shanming

doi:10.3390/math12172791

Open AccessArticle

Few-Shot Learning Sensitive Recognition Method Based on Prototypical Network

by

Guoquan Yuan

¹,

Xinjian Zhao

¹,

Liu Li

^2,*,

Song Zhang

¹ and

Shanming Wei

²

¹

State Grid Jiangsu Electric Power Co., Ltd., Information & Telecommunication Branch, Nanjing 210024, China

²

Department of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(17), 2791; https://doi.org/10.3390/math12172791

Submission received: 27 July 2024 / Revised: 6 September 2024 / Accepted: 7 September 2024 / Published: 9 September 2024

(This article belongs to the Special Issue Data Mining and Machine Learning in the Era of Big Knowledge and Large Models)

Download

Browse Figures

Versions Notes

Abstract

:

Traditional machine learning-based entity extraction methods rely heavily on feature engineering by experts, and the generalization ability of the model is poor. Prototype networks, on the other hand, can effectively use a small amount of labeled data to train models while using category prototypes to enhance the generalization ability of the models. Therefore, this paper proposes a prototype network-based named entity recognition (NER) method, namely the FSPN-NER model, to solve the problem of difficult recognition of sensitive data in data-sparse text. The model utilizes the positional coding model (PCM) to pre-train the data and perform feature extraction, then computes the prototype vectors to achieve entity matching, and finally introduces a boundary detection module to enhance the performance of the prototype network in the named entity recognition task. The model in this paper is compared with LSTM, BiLSTM, CRF, Transformer and their combination models, and the experimental results on the test dataset show that the model outperforms the comparative models with an accuracy of 84.8%, a recall of 85.8% and an F1 value of 0.853.

Keywords:

sensitive data recognition; NER; BiLSTM; CRF; prototypical network

MSC:

62Q05

1. Introduction

With the introduction of more smart devices in organizations, the size of data is increasing dramatically, placing higher demands on efficient processing and privacy protection. Increased demand for big data analytics requires sophisticated analysis of historical and real-time data while ensuring privacy security [1]. Edge computing applications alleviate the reliance on cloud services but require enhanced privacy protection on edge devices. In AI applications, privacy protection needs to be considered at the algorithm design stage to find a balance between data utilization and privacy protection [2]. The development of digitalization and intelligent technologies brings new challenges to data management and privacy protection. Therefore, comprehensive technical and policy tools are needed to ensure the secure management of sensitive data and to maintain system security and user privacy [3].

Named entity recognition (NER) [4] represents a pivotal aspect of the field of Natural Language Processing (NLP). The objective of NER is to identify entities with particular meanings within a given text. These entities may include names of individuals, locations, organizations, and other entities with particular meanings, such as dates, times, currencies and so forth [5]. Currently, NER tasks are divided into three main categories of methods: rule-based and dictionary-based methods, statistical learning-based methods, and deep learning-based methods. In early NER tasks, rule-based and dictionary-based methods were commonly used. These methods rely on manually developed rules, dictionaries, orthographic features and ontologies based on entity characterisation without the need for annotation data. Rule templates depend on the establishment of knowledge bases and dictionaries, offering a straightforward and efficient approach to managing numerous entities within a text [6]. Rule-based and dictionary-based methods usually rely on specific languages, domains, and knowledge bases, which limits their applicability and is expensive to maintain [7].

The traditional Recurrent Neural Network (RNN) model is inherently suited to processing sequential data. It processes words one by one in the order in which they are found in the sentence, and therefore naturally captures information about the order of words without additional processing [8]. The Transformer model, on the other hand, does not contain a traditional RNN or CNN structure. It feeds the words of an entire sentence into the network for processing at the same time, and therefore there is no explicit information about the relative or absolute position of the words during processing [9]. In order for the model to understand the position of each word in the sequence, we introduce the technique of Positional Encoding. This technique contributes an additional encoding to each word to represent its specific position in the sequence, thus enabling the model to efficiently understand the relative order of the words in the sentence.

In recent years, supervised learning methods based on features have gradually become mainstream methods for NER tasks. These methods usually require a large amount of labeled data to train the models, but they have achieved quite good performance in some specific tasks and domains. Prototypical networks, on the other hand, are suitable for dealing with few-shot learning problems, which is especially important in named entity recognition, since very often we may have very little labeled data [10].

While the existing methods show their strengths, they still face many challenges. Traditional machine learning-based entity extraction methods rely heavily on feature engineering by experts, and the generalization ability of the model is poor [11]. For the few-shot method, the accuracy and stability of the prototype may suffer and fail to capture subtle differences between different entity types. Prototypical networks still need to address challenges such as how to handle long sequences and unclear entity boundaries.

Based on the above observations, we propose a few-shot learning sensitive recognition method based on a prototypical network (FSPN-NER) for blockchain applications. With the prototypical network, the model can be effectively trained with few-shot labeled data, while the category prototype is used to enhance the generalization ability of the model. In addition, we use BiLSTM to integrate contextual information so as to better capture the subtle differences between entities and solve the problem of unclear entity boundaries with the boundary detection module.

In this paper, Section 2 details the related work we conducted before embarking on our research, focusing on named entity recognition techniques based on prototypical networks. Section 3 provides a FSPN-NER model with a comprehensive overview, including its structure, principles and advantageous features. Section 4 fully validates the superior performance and advantages of our proposed model over existing models through comparative experiments and ablation studies, contributing new insights and approaches to the development of this research area.

The primary contributions of this paper are outlined below.

A FSPN-NER model based on named entity recognition is proposed for grid-sensitive data recognition.
In this paper, the positional coding model (PCM) is used instead of BERT in the pre-training of the feature extraction module, and full word masks and N-gram masks are applied to improve the performance of PCM.
In the entity matching module, the prototype vectors are computed based on the prototypical network, and the feature vectors of the text are matched with the prototype vectors to obtain the probability that each word belongs to a different entity.
After conducting numerous comparative experiments, the FSPN-NER model outperforms LSTM, BiLSTM, CRF, Transformer and their combination models.

2. Related Work

Blockchain technology, through its decentralized, encrypted and tamper-proof nature, can help ensure the secure storage and transmission of sensitive data. In particular, in the supply chain blockchain, the use of prototype networks to identify and categorize sensitive products or transactions can ensure product compliance and security and detect problems in a timely manner. Analyzing user behavior on the blockchain and identifying abnormal behavioral patterns by learning from fewer samples can enhance the detection of fraud and other malicious activities. By storing hashes of sensitive data on the blockchain, data tampering and leakage can be prevented. Blockchain can also be used to establish and manage smart contracts for data access control, ensuring that only authorized users can access specific sensitive data, thus improving data security. Currently, identification techniques for sensitive data are mainly realized through named entity identification.

Named entity recognition and classification (NERC) is a crucial task in natural language processing (NLP) for extracting information units, such as persons, dates and locations from unstructured text [12]. Over the years, NER has evolved from rule-based methods to models that leverage deep learning techniques and pre-trained language models.

Early NER tasks depended on lexical features extracted from extensive datasets and named entity libraries [13]. Rule-based approaches made use of pre-defined patterns that were constructed using features like keywords, syntactic-lexical patterns and statistical data to match strings. Kim [14] suggested the application of Brill’s rule-based inference approach for speech input, wherein rules are automatically generated based on Brill’s part-of-speech tagger. Quimbaya et al. [15] introduced a lexicon-driven methodology for named entity recognition (NER) within electronic health records. However, these methods were labor-intensive, costly and rigid.

To address the limitations of rule-based methods, researchers proposed treating NER as a sequence labeling challenge, which led to the development of various models, such as Hidden Markov Models (HMMs) [16], maximum-entropy Markov models (MEMMs) [17], support vector machines (SVMs) [18], and conditional random fields (CRFs) [19]. These models relied on manually crafted discrete features for sequence labeling. Stapelberg and Keara-Linn [20] introduced an automatic English speech recognition system combining the Hidden Markov Model and the conditional random field algorithm, aiming to improve the accuracy and stability of the recognition and discuss its application in the field of computers. Krishnan and Manning [21] introduced a dual-stage methodology utilizing two interconnected conditional random field (CRF) classifiers. The secondary CRF incorporates latent representations obtained from the output of the primary CRF.

The latest progress in deep learning has resulted in the utilization of neural networks for NER tasks. These models autonomously acquire characteristics from the input data, lessening the dependence on manually designed features. The use of unidirectional LSTMs, convolutional neural networks, bidirectional LSTMs with CRFs, and character-level CNNs has been explored in NER literature. The rise of pre-trained language models like BERT and ELMo has significantly enhanced the performance of NER through advanced feature extraction capabilities. L Luo et al. [22] introduced an attention-based bidirectional Long Short-Term Memory with a conditional random field layer (Att-BiLSTM-CRF) neural network approach for document-level chemical named entity recognition. Liu et al. [23] presented a novel approach using BERT, BiLSTM and CRF for named entity recognition in citrus pests and diseases, aiming to extract specific entities from unstructured text data to facilitate the construction of a knowledge map for accurate prevention and control methods in agriculture. Yuan et al. [24] suggested incorporating adversarial training into the model training process as a regularization technique to reduce the impact of noise on the model. In addition, they introduced self-attention to the BiLSTM-CRF model in order to capture significant features that affect entity classification and enhance the accuracy of entity classification. Liang et al. [4] proposed a Chinese named entity recognition (NER) method and a Relation Extraction (RE) method in the field of Chinese literature. Their approach is built upon the self-attention mechanism, the BiLSTM neural network and the CRF model. Specifically, they introduced a BiLSTM-Self-Attention-CRF framework for NER and a BiLSTM-Multilevel-Attention framework for RE.

Recently, most of the research has been devoted to the study of combining prototypical networks and named entity recognition. The advantages of prototypical networks are their adaptability to few-shot learning and their strong generalization ability, properties that are useful for NER tasks characterized by scarce labeled data and strong entity context dependency. Bin Ji et al. [25] introduced an entity-level prototypical network (EP-Net) enhanced with decentralized distributed prototypes, where text spans are treated as candidate entities, eliminating the need for labeling dependencies. Ritesh Kumar et al. [26] introduced ProtoNER, an end-to-end KVP extraction model based on prototypical networks. This model enables the addition of new classes to a pre-existing model with minimal newly annotated training samples. Yucheng Huang et al. [27] introduced COPNER, which offers a unique prompt consisting of class-specific words to provide supervision signals, enabling contrastive learning to optimize token representations and metric referents for distance-metric inference on test samples.

Overall, NER research has made great progress in recent years, from lexical and rule-based approaches to more complex models utilizing deep learning techniques and pre-trained language models. Currently, there are problems of data scarcity and diversity in sensitive data recognition, while in this paper, prototype networks are applied to named entity recognition, which can not only effectively solve the problems of data scarcity and diversity but also combine the advantages of deep learning to provide fast and accurate recognition capabilities for named entities in various textual data. Therefore, this paper proposes a prototype network-based named entity recognition method combined with deep learning for sensitive data recognition.

3. Methodology

3.1. Overview

Our model is roughly divided into three parts, as shown in Figure 1, including the feature extraction module, the entity matching module and the boundary detection module. In the feature extraction module, we first pre-train the data by an improved location coding model based on BERT, and then realize the feature extraction of word vectors by BiLSTM, i.e., the pre-trained word vectors are encoded to transform the input sequence into a series of feature vectors. In the entity matching module, the prototype vectors are computed based on the feature vectors obtained from the feature extraction module, and the features in the query set are matched with the type-averaged features to obtain the type of each word and realize the recognition of sensitive data. In addition, in order to improve the performance of the prototypical network in the case of the few-shot model, we introduce the boundary detection module. This module computes the boundary labeling probability of each word by combining contextual features and type information, thus improving the performance of the prototypical network in the named entity recognition task.

3.2. Feature Extraction Module

3.2.1. Pre-Training

The PCM is an improvement on BERT. They have the same neural structure but differ slightly in the input and training goal. The PCM uses rewritten sentences as input, and the training goal is to predict the position of the original token. In this model, we use WordPiece for tokenization, directly employ vocabularies and replace the Masked Language Model (MLM) with PerL. The prediction space is based on the input sequence length instead of whole vocabularies (like MLM), and no artificial masking tokens [MASK] are used, while we use whole-word masking and N-gram masking to further improve performance. Among other things, MLM is a deep learning technique widely used in natural language processing (NLP) tasks. In MLM, a portion of the input text is “masked” or randomly replaced with special tokens (usually [MASK]), and the model is trained to predict the original tokens based on their surrounding context.

We use whole-word masking and N-gram masking strategies to select candidate tokens for masking, where the word-level unit word features to 4 g are 40%, 30%, 20% and 10%, respectively. We use 15% of the input word percentage for masking. Of these, we randomly select 90% of the tokens and permute their order. Note that the blending process is performed only for these 90% tokens and not for the entire input sequence. For the remaining 10% of tokens, we keep them unchanged and treat them as negative samples.

Formally, considering two sequences A = {

A_{1}, \dots, A_{n}

} and B = {

B_{1}, \dots B_{m}

}, we first create new pairs of input sequences

A^{'}

= {

A_{1}^{'}, \dots, A_{n}^{'}

} and

B^{'}

= {

B_{1}^{'}, \dots, B_{m}^{'}

}, where some of the word positions are swapped. The two sequences are combined together to create the input sequence X of PCM.

X = [C L S] A_{1}^{'} \dots A_{n}^{'} [S E P] B_{1}^{'} \dots B_{m}^{'} [S E P]

(1)

where

[C L S]

is usually located at the beginning of the input sequence and is used to indicate the classification result of the sequence.

[S E P]

is used to separate two sentences or paragraphs.

The PCM then transforms X into a contextual depiction

H \in R^{N \times d}

, where N is the longest sequence allowed and d is the size of the hidden layer, through the utilization of an embedding layer that includes word embeddings, positional embeddings, and token type (segment) embeddings, along with a series of L-layer converters.

H^{(0)} = E m b e d d i n g (X)

(2)

H^{(i)} = T r a n s f o r m e r (H^{(i - 1)}), i \in {1, \dots, L}, H = H^{(L)}

(3)

Like MLM and Mac (MacBERT target), we only need to predict the positions selected in PerLM. We gather a subset with respect to these spots to create a candidate representation

H_{m} \in R^{k \times d}

, where k is the quantity of selected tokens. Based on the definition of PerLM, we use a masking rate of 15%, which provides us with

k = [N \times 15 %]

.

We then use a feedforward dense layer (FFN) subsequent to a dropout layer and a normalization layer.

\tilde{H^{m}} = L a y e r N o r m (D r o p o u t (F F N (H^{m}))))

(4)

Calculating the position of the initial token, we simply make a dot product between “

H_{m}

” and “H”. Then, we add a bias term

b \in R^{L}

and use the softmax function to obtain the normalized probability.

p_{i} = s o f t m a x (\tilde{H_{i}^{m}} H^{T} + b), p_{i} \in R^{L}

(5)

Finally, we employ the standard cross-entropy loss to optimize the pre-training task.

L = - \frac{1}{M} \sum_{i = 1}^{M} y_{i} l o g p_{i}

(6)

3.2.2. Feature Extraction

This part mainly encodes the word vectors in the previous step by BiLSTM and transforms the input sequence into a series of feature vectors. The word vectors with n-dimensions, acquired in the preceding step, are employed as inputs at every time step of the BiLSTM. This process results in the generation of hidden state sequences from the bi-directional LSTM layer. After both the forward LSTM and the backward LSTM complete their respective processes, the hidden state sequences from each are combined based on their positions, producing the complete hidden state sequences represented as t = {1, 2, …, n}. Subsequently, the linear output layer converts these full hidden state sequences into “s”, with “n” denoting the number of categories for labeling in the label set. Remembering the final captured sentence characteristics as the sequence post all mappings into a matrix, each dimension

l_{u}, l_{v}

of the matrix L corresponds to the fraction value of its word

a_{u}

corresponding to each category label

b_{v}

, where u = {1, 2, …, n}, v = {1, 2, …, n}. If at this time we opt to categorize the score value of each position separately and simply choose the highest score for each to obtain the output result directly, we fail to take into account the information between adjacent sentences. As a result, we are not able to achieve the global optimum, leading to less than ideal classification results. To address this issue, we introduce the final layer of the model.

The BiLSTM model is composed of an input layer, a forward LSTM layer, a backward LSTM layer, and an output layer. Each training sequence is trained with one forward LSTM and one backward LSTM. This bidirectional network architecture divides the outputs of the LSTM into forward and backward sequences, enabling the model to comprehensively grasp the contextual significance of every word.

3.3. Entity Matching Module

This module focuses on first calculating the average features of each type of entity in the support set and optimizing them with a joint loss function. Then, the features in the query set are matched with the average features of the types, the distance situation between each query point and each prototype representation of the classification is calculated, and the soft maximum probability result is calculated to generate the probability distribution of each classification. The probability of each word belonging to a different type can then be calculated, yielding the probability of what type each word belongs to in order to determine the sensitive data.

3.3.1. Computing the Prototype Vector

For each entity type k, the average of all sample features of the support set is computed to obtain the prototype vector

c_{k}

for that type:

c_{k} = \frac{1}{|S_{k}|} \sum_{(x_{i}, y_{i}) \in S_{k}} f_{θ} (x_{i})

(7)

where

f_{θ} (x_{i})

denotes the features of each entity sample and

S_{k}

denotes the set of all samples of the kth class of entities.

Here, the model performance is optimized by designing a joint loss function that combines classification loss and prototype matching loss.

In this case, to calculate the classification loss, the prototype vector is used to classify the input samples, the distance between the samples and each prototype is calculated, and the prototype with the smallest distance is selected as the predicted category. The cross-entropy loss function is used to measure the difference between the predicted category and the real category.

For each input sample, the classification loss is calculated using the cross-entropy loss function based on the distance of its feature vector

f_{θ} (x)

from the prototype vector

c_{k}

of each type:

L_{c l a s s i f i c a t i o n} = - \sum_{i}^{} l o g P (y_{i} | x_{i})

(8)

The prototype matching loss is computed by ensuring that the feature representation of the input sample has the smallest distance from its true category prototype vector and a larger distance from the prototype vectors of other categories. This step is achieved by minimizing the Euclidean distance between the sample and its true category prototype:

L_{p r o t o t y p e} = \sum_{i}^{} {| | f_{θ} (x_{i}) - c_{y_{i}} | |}^{2}

(9)

Finally, the classification loss and the prototype matching loss are combined and optimized by means of a weighted sum in order to consider both classification accuracy and feature matching during the training process.

L_{t o t a l} = λ L_{c l a s s i f i c a t i o n} + (1 - λ) L_{p r o t o t y p e}

(10)

where

λ

is a weighting parameter to balance the effects of the two losses.

3.3.2. Entity Matching

For each sample in the query set

x_{q}

, extract its feature representation

f_{θ} (x_{q})

. Compute the Euclidean distance between each query point

x_{q}

and each classification prototype representation

c_{k}

:

d (x_{q}, c_{k}) = | | f_{θ} (x_{q}) - c_{k} | |

(11)

Convert the distances to softmax probabilities to generate probability distributions for each classification:

P (y = k | x_{q}) = \frac{e x p (- d (x_{q}, c_{k}))}{\sum_{k^{'}} e x p (- d (x_{q}, c_{k^{'}}))}

(12)

where

P (y = k | x_{q})

denotes the probability that sample

x_{q}

belongs to type k.

Determine the most likely entity type for each query sample based on its probability distribution:

\hat{y_{q}} = a r g max_{k} P (y = k | x_{q})

(13)

In this way, the probability of each word belonging to a different type is calculated to determine the type of entity belonging to the sensitive data.

3.4. Boundary Detection Module

The contextual feature representation

h = (h_{1}, h_{2}, \dots, h_{T})

obtained in Section 3.2.2 is input into the CRF layer, which combines the type information (e.g., entity type labels) and contextual information to compute the boundary labeling probability of each word. CRF is able to capture the transfer probability between labels by modeling the dependencies between label sequences, thus improving the accuracy of the boundary labeling prediction. The CRF layer finds the globally optimal labeling when decoding sequences, making boundary detection more reliable.

Linear layer mapping: the feature representation h is mapped to the labeling space through a linear layer, generating the firing score matrix E. The mapping is performed by a linear layer:

E = W h + b

(14)

where W is the weight matrix, b is the bias term,

E \in R^{T \times K}

, and K is the number of labels.

The CRF layer does this by defining a transfer matrix

A \in R^{K \times K}

, where B denotes the score for transferring from label i to label j. For a sequence of labels

y = (y_{1}, y_{2}, \dots, y_{T})

, the score is

s c o r e (x, y) = \sum_{t = 1}^{T} E_{t, y_{t}} + \sum_{t = 2}^{T} A_{y_{t} - 1, y_{t}}

(15)

where

E_{t, y_{t}}

is the emission fraction and

A_{y_{t} - 1, y_{t}}

is the transfer fraction.

Next, the total score of all possible label sequences is calculated as a normalization factor:

Z (x) = \sum_{y^{'} \in y (x)}^{} e x p (s c o r e (x, y^{'}))

(16)

where

y (x)

is the set of all possible labeling sequences.

Given a sequence of labels y, we compute its conditional probability, i.e., the boundary labeling probability:

P (y | x) = \frac{e x p (s c o r e (x, y))}{Z (x)}

(17)

We optimize the log-likelihood to maximize the model fit to the training data. The objective function is

L = l o g P (y | x) = s c o r e (x, y) - l o g Z (x)

(18)

In prediction, boundary detection is achieved by finding the optimal label sequence through the Viterbi algorithm. That is, the label sequence is found that makes the score maximum:

\hat{y} = a r g max_{y^{'} \in y (x)} s c o r e (x, y^{'})

(19)

4. Experimentation

In this section, we first focus on the dataset, second on the evaluation metrics, including the correctness (P), recall (R) and F1 value, and finally on the comparison and ablation experiments.

4.1. Data Sets

We collected grid traffic data which include common information such as name, time and address. We created a few-shot dataset of 48,821 sentences from these data. We partitioed the dataset according to specific criteria, setting aside the last 10% of the data for testing purposes. The remaining data were then split into training and validation sets at a ratio of 9:1, respectively. The specific division results are shown in Table 1.

For the ablation study, we used the ChFinAnn dataset and the AdminPunish dataset. In the ChFinAnn dataset, we appropriately combined entities according to their actual meanings in order to simplify the problem to a certain extent. For instance, the terms “HighestTradingPrice” and “LowestTradingPrice” both denote prices; therefore, we consolidated them into a unified entity called “Price”. This integration strategy resulted in a total of 10 categories of entities for the ChFinAnn dataset. The seven entities in the AdminPunish dataset remained unchanged. Table 2 presents the names of the entities and their respective proportions in the associated datasets.

4.2. Data Process

Traditional preprocessing approaches usually involve mapping the character IDs in a sentence to a multi-dimensional space and then processing these multi-dimensional data using a word2vec model. To enable end-to-end model building, a new approach is to use an embedding layer instead of a word2vec model for the transformation of multi-dimensional features. In this approach, the preprocessing stage simply converts the characters in the sentence to IDs and converts the entities to the corresponding category IDs, thus providing more efficient and accurate input data for the subsequent model. This approach not only simplifies the whole processing flow, but also improves the performance and efficiency of the model.

4.3. Evaluation Indicators

In named entity recognition tasks, the evaluation criteria usually include precision, recall and F1 value. The F1 score is a widely used evaluation metric for assessing the effectiveness of classification models. For each category (e.g., address, book, company, etc.), the F1 score can be calculated. The F1 score combines accuracy and recall, and it provides a good measure of the model’s combined performance in predicting positive and negative examples. Precision refers to the ratio of accurately predicted positive samples, while recall refers to the ratio of accurately predicted actual samples. F1 score represents the harmonized mean of precision and recall, allowing for a more comprehensive evaluation of the model’s performance by incorporating both metrics.

4.4. Experimental Comparison

We used BiLSTM replacing FeedForward to optimize the Transformer structure. After conducting thorough comparative experiments, the study findings indicate that the enhanced Transformer–BiLSTM–CRF model outperforms the LSTM, BiLSTM, CRF, Transformer and their individual combination models in terms of performance [28]. Based on the data in Table 3, we trained nine different models and obtained their metrics such as precision, recall and F1 value for detailed comparison and analysis.

Compared with simple neural networks, LSTM has obvious advantages in processing sequence data. LSTM is a unique type of recurrent neural network architecture designed to manage the flow of information using input gates, forget gates and output gates, enabling it to effectively capture long-term dependencies within sequential data [29]. Compared with traditional forward deep neural networks, LSTM can better capture long-term features in sequence data and is suitable for tasks that require memorizing long-distance information.

For tasks involving more intricate sequence data, the BiLSTM is better suited than a unidirectional LSTM. BiLSTM combines two directions of information flow, forward and backward from the current timestep, so it can better capture the features and patterns amidst the time sequential data, improving the ability of modeling the sequence data, which is suitable for tasks that need to consider contextual information [30].

Furthermore, the Transformer is a kind of neural network structure that utilizes a self-attention mechanism, comprising numerous layers of feed-forward neural networks. It is adept at dealing with dependencies between positions in an input sequence, better able to learn long-range dependencies and adept at extracting underlying features from data embedded in a high-dimensional space. However, Transformer does not perform well in tasks that require complex combinations of features because its self-attention mechanism is more suitable for capturing local rather than global information and cannot directly accomplish some tasks that require complex combinations of features.

Since machine learning heavily depends on artificial features and the selection of features, CRF itself does not show good performance. However, with the introduction of BiLSTM for feature extraction, CRF can improve its results by effectively combining features through its transition matrix. Superior results can be achieved by more intricate neural network structures. The newly proposed model integrates the strengths of multiple models and surpasses the performance of the other models under comparison. Also, the introduction of Transformer to extract the underlying features is an important improvement. The improved Transformer uses BiLSTM instead of the traditional feed-forward network and performs much better in extracting the underlying temporal features. BiLSTM assists in more effectively extracting and integrating features. Ultimately, the probability distribution of the output is refined by training the transition matrix using CRF.

In the test few-shot dataset, the new model achieves an accuracy of 84.5%, a recall of 85.5%, and an F1 value of 0.85, values which outperform the performance of other compared models. The results indicate that the suggested model has made substantial progress in this task, showcasing its effectiveness and superiority.

4.5. Ablation Experiment

4.5.1. Impact of the Size of the Training Sample

Additionally, the performance of the FSPN-NER model was evaluated with varying training sample sizes. As illustrated in Table 4 and Table 5, it can be concluded that the F1 scores of the FSPN-NER models trained with distinct sample sizes on the test samples exhibited notable discrepancies. In comparison to the BiLSTM-CRF model, the F1 scores of the FSPN-NER model demonstrate a gradual improvement with an increase in the training sample size. However, the FSPN-NER model appears to perform more robustly, further substantiating its superiority. Figure 2 presents the F1 scores obtained from the ChFinAnn and AdminPunish datasets, respectively. A comparison of the two datasets reveals that the F1 scores from the AdminPunish dataset are generally higher than those from the ChFinAnn dataset. This is mainly due to the entity sparsity phenomenon observed in the ChFinAnn dataset.

4.5.2. Impact of Embedded Dimensions

In our study, we examined the effect of word vector dimensionality on model performance. We utilized four distinct dimensions of pre-trained word vectors, including 50, 100, 200 and 300. According to the data in Table 6, when the word vector dimension is 100, the correctness and recall are higher, and the F1-score reaches the highest value of 73.50. Hence, we opted to establish the dimension of the pre-trained word vectors as 100 in order to achieve optimal model performance. However, when the dimensionality of the word vectors is increased to 200 and 300, the F1-score starts to decrease, which indicates that the dimension of the word vectors is not as large as it should be.

According to Figure 3, this phenomenon can be explained by the fact that although increasing the dimensionality of the embedded word vectors may result in the acquisition of more information and features, it also adds to the complexity of the model and the use of computational resources, and may even lead to overfitting problems. Furthermore, increasing the dimensionality of embedded word vectors also increases the training and inference time of the model, which is an unwise decision especially when resources are limited. Therefore, in practical applications, we need to weigh the impact of word vector dimensionality on model performance and avoid blindly pursuing higher dimensionality while neglecting the balance between performance and efficiency.

5. Discussion and Conclusions

The aim of this paper is to investigate the few-shot learning sensitive recognition method based on a prototypical network. The sensitive data recognition domain possesses a large amount of specialized knowledge texts in various formats, but the utilization of this information is relatively low. To extract sensitive data entities more efficiently, we propose a FSPN-NER model based on a prototypical network. We train and evaluate the model using a traffic few-shot dataset. The results show that the model outperforms the comparison model in terms of accuracy, recall and F1 value. Ablation experiments are used to verify the effect of training sample size and word vector embedding dimension on the model.

In our future research work, we aim to incorporate active learning techniques with the goal of allowing the model to identify and forward data that are highly uncertain to experts for annotation. This process will then enhance the performance of the model. The method will help to advance deep learning-based NER techniques to improve the efficiency and accuracy of extracting sensitive data entities.

Author Contributions

Methodology, X.Z. and L.L.; validation, X.Z. and G.Y.; formal analysis, X.Z. and S.Z.; investigation, X.Z., G.Y. and L.L.; writing—original draft, X.Z.; writing—review and editing, X.Z.; visualization, X.Z.; supervision, S.W.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Project of State Grid Jiangsu Electric Power Company Ltd., under Grant J2023179.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Authors Guoquan Yuan, Xinjian Zhao and Song Zhang were employed by the company State Grid Jiangsu Electric Power Co. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Dogan, A.; Birant, D. Machine learning and data mining in manufacturing. Expert Syst. Appl. 2021, 166, 114060. [Google Scholar] [CrossRef]
Mehrotra, S.; Sharma, S.; Ullman, J.D.; Ghosh, D.; Gupta, P.; Mishra, A. Panda: Partitioned data security on outsourced sensitive and non-sensitive data. ACM Trans. Manag. Inf. Syst. 2020, 11, 1–41. [Google Scholar] [CrossRef]
Wang, X.; Liu, J. A novel feature integration and entity boundary detection for named entity recognition in cybersecurity. Knowl.-Based Syst. 2023, 260, 110114. [Google Scholar] [CrossRef]
Liang, L.X.; Lin, L.; Lin, E.; Wen, W.S.; Huang, G.Y. A Joint Learning Model to Extract Entities and Relations for Chinese Literature Based on Self-Attention. Mathematics 2022, 10, 2216. [Google Scholar] [CrossRef]
Sun, J.; Liu, Y.; Cui, J.; He, H. Deep learning-based methods for natural hazard named entity recognition. Sci. Rep. 2022, 12, 4598. [Google Scholar] [CrossRef]
Cheng, J.; Liu, J.; Xu, X.; Xia, D.; Liu, L.; Sheng, V.S. A review of Chinese named entity recognition. KSII Trans. Internet Inf. Syst. 2021, 15, 2012–2030. [Google Scholar] [CrossRef]
Chen, X.; Ouyang, C.; Liu, Y.; Bu, Y. Improving the named entity recognition of Chinese electronic medical records by combining domain dictionary and rules. Int. J. Environ. Res. Public Health 2020, 17, 2687. [Google Scholar] [CrossRef]
Ali, S.; Masood, K.; Riaz, A.; Saud, A. Named entity recognition using deep learning: A review. In Proceedings of the 2022 IEEE International Conference on Business Analytics for Technology and Security (ICBATS), Dubai, United Arab Emirates, 16–17 February 2022; pp. 1–7. [Google Scholar]
Yan, H.; Deng, B.; Li, X.; Qiu, X. TENER: Adapting transformer encoder for named entity recognition. arXiv 2019, arXiv:1911.04474. [Google Scholar]
de Lichy, C.; Glaude, H.; Campbell, W. Meta-learning for few-shot named entity recognition. In Proceedings of the 1st Workshop on Meta Learning and Its Applications to Natural Language Processing, Bangkok, Thailand, 5 August 2021; pp. 44–58. [Google Scholar]
Ji, Z.; Wang, X.; Cai, C.; Sun, H. Power entity recognition based on bidirectional long short-term memory and conditional random fields. Glob. Energy Interconnect. 2020, 3, 186–192. [Google Scholar] [CrossRef]
Li, Y.; Yu, Y.; Qian, T. Type-aware decomposed framework for few-shot named entity recognition. arXiv 2023, arXiv:2302.06397. [Google Scholar]
Wang, Y.; Meng, X.; Zhu, L. Cell group recognition method based on adaptive mutation PSO-SVM. Cells 2018, 7, 135. [Google Scholar] [CrossRef]
Zhang, Y.; Sui, X.; Pan, F.; Yu, K.; Li, K.; Tian, S.; Erdengasileng, A.; Han, Q.; Wang, W.; Wang, J.; et al. BioKG: A comprehensive, large-scale biomedical knowledge graph for AI-powered, data-driven biomedical research. bioRxiv 2023. [Google Scholar] [CrossRef]
Quimbaya, A.P.; Múnera, A.S.; Rivera, R.A.G.; Rodríguez, J.C.D.; Velandia, O.M.M.; Peña, A.A.G.; Labbé, C. Named entity recognition over electronic health records through a combined dictionary-based approach. Procedia Comput. Sci. 2016, 100, 55–61. [Google Scholar] [CrossRef]
Pan, W.; Li, H.; Zhou, X.; Jiao, J.; Zhu, C.; Zhang, Q. Research on pig sound recognition based on deep neural network and hidden Markov models. Sensors 2024, 24, 1269. [Google Scholar] [CrossRef]
Tillé, Y.; Panahbehagh, B. Maximum Entropy Design by a Markov Chain Process. J. Surv. Stat. Methodol. 2024, 12, 232–248. [Google Scholar] [CrossRef]
Li, J.; Li, Y.; Song, J.; Zhang, J.; Zhang, S. Quantum support vector machine for classifying noisy data. IEEE Trans. Comput. 2024, 73, 2233–2247. [Google Scholar] [CrossRef]
Pradhan, A.; Yajnik, A. Parts-of-speech tagging of Nepali texts with Bidirectional LSTM, Conditional Random Fields and HMM. Multimed. Tools Appl. 2024, 83, 9893–9909. [Google Scholar] [CrossRef]
Junling, Y. Online learning system for English speech automatic recognition based on hidden Markov model algorithm and conditional random field algorithm. Entertain. Comput. 2024, 51, 100729. [Google Scholar] [CrossRef]
Krishnan, V.; Manning, C.D. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17–21 July 2006; pp. 1121–1128. [Google Scholar]
Luo, L.; Yang, Z.; Yang, P.; Zhang, Y.; Wang, L.; Lin, H.; Wang, J. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 2018, 34, 1381–1388. [Google Scholar] [CrossRef]
Liu, Y.; Wei, S.; Huang, H.; Lai, Q.; Li, M.; Guan, L. Naming entity recognition of citrus pests and diseases based on the BERT-BiLSTM-CRF model. Expert Syst. Appl. 2023, 234, 121103. [Google Scholar] [CrossRef]
Yuan, T.; Qin, X.; Wei, C. A Chinese named entity recognition method based on ERNIE-BiLSTM-CRF for food safety domain. Appl. Sci. 2023, 13, 2849. [Google Scholar] [CrossRef]
Ji, B.; Li, S.; Gan, S.; Yu, J.; Ma, J.; Liu, H. Few-shot named entity recognition with entity-level prototypical network enhanced by dispersedly distributed prototypes. arXiv 2022, arXiv:2208.08023. [Google Scholar]
Kumar, R.; Goyal, S.; Verma, A.; Isahagian, V. ProtoNER: Few Shot Incremental Learning for Named Entity Recognition Using Prototypical Networks. In Proceedings of the International Conference on Business Process Management, Utrecht, The Netherlands, 11–15 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 70–82. [Google Scholar]
Huang, Y.; He, K.; Wang, Y.; Zhang, X.; Gong, T.; Mao, R.; Li, C. Copner: Contrastive learning with prompt guiding for few-shot named entity recognition. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 2515–2527. [Google Scholar]
He, B.; Chen, J. Named entity recognition method in network security domain based on BERT-BiLSTM-CRF. In Proceedings of the 2021 IEEE 21st International Conference on Communication Technology (ICCT), Tianjin, China, 13–16 October 2021; pp. 508–512. [Google Scholar]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Yang, M.; Wang, J. Adaptability of financial time series prediction based on BiLSTM. Procedia Comput. Sci. 2022, 199, 18–25. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the model.

Figure 2. Effect of training sample size on F1 scores.

Figure 3. Comparing the performance of word vectors with different dimensions.

Table 1. Data set splitting.

Dataset	Train	Validation	Test	Sum
samples	39,545	4394	4882	48,821

Table 2. Names of entities and their corresponding scales for each dataset.

ChFinAnn		AdminPunish
Entry Name	Proportion (%)	Entry Name	Proportion (%)
Price	3.43	Client	27.51
Shares	27.76	FileNumber	18.33
Institution	0.51	Punishing Authority	18.65
Company	8.49	Unified Social Credit Codes	7.82
StockAbbr	8.45	Legal Representative	11.38
StockCode	9.66	Id Card Number	3.41
Date	19.77	Address	12.90
Ratio	5.50
EquityHolder	10.42
Pledgee	6.00

Table 3. Evaluation indicator results.

Model	Precision	Recall	F1-Score
DNN	48.9%	49.6%	49.3%
CRF [19]	70.8%	70.4%	70.6%
LSTM	70.9%	69.7%	70.3%
BiLSTM	70.8%	73.9%	72.4%
BiLSTM + CRF [24]	73.9%	76.1%	75.0%
Transformer	42.1%	46.2%	44.2%
Transfoemer + BiLSTM	83.5%	83.9%	83.7%
Transformer + BiLSTM + CRF [22]	83.2%	84.9%	84.1%
Our model	84.5%	85.5%	85.0%

Table 4. The impact of the training sample size on F1 scores for the ChFinAnn dataset.

Training	FSPN-NER	BiLSTM-CRF
6.25%	90.32%	82.67%
12.5%	91.63%	85.56%
25%	92.21%	85.98%
50%	92.65%	87.47%
100%	93.46%	87.47%

Table 5. Effect of training sample size on F1 scores for the AdminPunish dataset.

Training	FSPN-NER	BiLSTM-CRF
6.25%	95.71%	86.53%
12.5%	95.79%	90.52%
25%	95.84%	92.63%
50%	96.43%	93.82%
100%	97.35%	96.11%

Table 6. Comparing the performance of word vectors with different dimensions.

Embedding Dimensions	Precison (%)	Recall (%)	F1-Score (%)
50	69.02	76.00	72.34
100	71.57	75.55	73.50
200	70.82	75.26	72.97
300	70.36	75.87	73.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, G.; Zhao, X.; Li, L.; Zhang, S.; Wei, S. Few-Shot Learning Sensitive Recognition Method Based on Prototypical Network. Mathematics 2024, 12, 2791. https://doi.org/10.3390/math12172791

AMA Style

Yuan G, Zhao X, Li L, Zhang S, Wei S. Few-Shot Learning Sensitive Recognition Method Based on Prototypical Network. Mathematics. 2024; 12(17):2791. https://doi.org/10.3390/math12172791

Chicago/Turabian Style

Yuan, Guoquan, Xinjian Zhao, Liu Li, Song Zhang, and Shanming Wei. 2024. "Few-Shot Learning Sensitive Recognition Method Based on Prototypical Network" Mathematics 12, no. 17: 2791. https://doi.org/10.3390/math12172791

APA Style

Yuan, G., Zhao, X., Li, L., Zhang, S., & Wei, S. (2024). Few-Shot Learning Sensitive Recognition Method Based on Prototypical Network. Mathematics, 12(17), 2791. https://doi.org/10.3390/math12172791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Learning Sensitive Recognition Method Based on Prototypical Network

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overview

3.2. Feature Extraction Module

3.2.1. Pre-Training

3.2.2. Feature Extraction

3.3. Entity Matching Module

3.3.1. Computing the Prototype Vector

3.3.2. Entity Matching

3.4. Boundary Detection Module

4. Experimentation

4.1. Data Sets

4.2. Data Process

4.3. Evaluation Indicators

4.4. Experimental Comparison

4.5. Ablation Experiment

4.5.1. Impact of the Size of the Training Sample

4.5.2. Impact of Embedded Dimensions

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI