1. Introduction
With the introduction of more smart devices in organizations, the size of data is increasing dramatically, placing higher demands on efficient processing and privacy protection. Increased demand for big data analytics requires sophisticated analysis of historical and real-time data while ensuring privacy security [
1]. Edge computing applications alleviate the reliance on cloud services but require enhanced privacy protection on edge devices. In AI applications, privacy protection needs to be considered at the algorithm design stage to find a balance between data utilization and privacy protection [
2]. The development of digitalization and intelligent technologies brings new challenges to data management and privacy protection. Therefore, comprehensive technical and policy tools are needed to ensure the secure management of sensitive data and to maintain system security and user privacy [
3].
Named entity recognition (NER) [
4] represents a pivotal aspect of the field of Natural Language Processing (NLP). The objective of NER is to identify entities with particular meanings within a given text. These entities may include names of individuals, locations, organizations, and other entities with particular meanings, such as dates, times, currencies and so forth [
5]. Currently, NER tasks are divided into three main categories of methods: rule-based and dictionary-based methods, statistical learning-based methods, and deep learning-based methods. In early NER tasks, rule-based and dictionary-based methods were commonly used. These methods rely on manually developed rules, dictionaries, orthographic features and ontologies based on entity characterisation without the need for annotation data. Rule templates depend on the establishment of knowledge bases and dictionaries, offering a straightforward and efficient approach to managing numerous entities within a text [
6]. Rule-based and dictionary-based methods usually rely on specific languages, domains, and knowledge bases, which limits their applicability and is expensive to maintain [
7].
The traditional Recurrent Neural Network (RNN) model is inherently suited to processing sequential data. It processes words one by one in the order in which they are found in the sentence, and therefore naturally captures information about the order of words without additional processing [
8]. The Transformer model, on the other hand, does not contain a traditional RNN or CNN structure. It feeds the words of an entire sentence into the network for processing at the same time, and therefore there is no explicit information about the relative or absolute position of the words during processing [
9]. In order for the model to understand the position of each word in the sequence, we introduce the technique of Positional Encoding. This technique contributes an additional encoding to each word to represent its specific position in the sequence, thus enabling the model to efficiently understand the relative order of the words in the sentence.
In recent years, supervised learning methods based on features have gradually become mainstream methods for NER tasks. These methods usually require a large amount of labeled data to train the models, but they have achieved quite good performance in some specific tasks and domains. Prototypical networks, on the other hand, are suitable for dealing with few-shot learning problems, which is especially important in named entity recognition, since very often we may have very little labeled data [
10].
While the existing methods show their strengths, they still face many challenges. Traditional machine learning-based entity extraction methods rely heavily on feature engineering by experts, and the generalization ability of the model is poor [
11]. For the few-shot method, the accuracy and stability of the prototype may suffer and fail to capture subtle differences between different entity types. Prototypical networks still need to address challenges such as how to handle long sequences and unclear entity boundaries.
Based on the above observations, we propose a few-shot learning sensitive recognition method based on a prototypical network (FSPN-NER) for blockchain applications. With the prototypical network, the model can be effectively trained with few-shot labeled data, while the category prototype is used to enhance the generalization ability of the model. In addition, we use BiLSTM to integrate contextual information so as to better capture the subtle differences between entities and solve the problem of unclear entity boundaries with the boundary detection module.
In this paper,
Section 2 details the related work we conducted before embarking on our research, focusing on named entity recognition techniques based on prototypical networks.
Section 3 provides a FSPN-NER model with a comprehensive overview, including its structure, principles and advantageous features.
Section 4 fully validates the superior performance and advantages of our proposed model over existing models through comparative experiments and ablation studies, contributing new insights and approaches to the development of this research area.
The primary contributions of this paper are outlined below.
A FSPN-NER model based on named entity recognition is proposed for grid-sensitive data recognition.
In this paper, the positional coding model (PCM) is used instead of BERT in the pre-training of the feature extraction module, and full word masks and N-gram masks are applied to improve the performance of PCM.
In the entity matching module, the prototype vectors are computed based on the prototypical network, and the feature vectors of the text are matched with the prototype vectors to obtain the probability that each word belongs to a different entity.
After conducting numerous comparative experiments, the FSPN-NER model outperforms LSTM, BiLSTM, CRF, Transformer and their combination models.
2. Related Work
Blockchain technology, through its decentralized, encrypted and tamper-proof nature, can help ensure the secure storage and transmission of sensitive data. In particular, in the supply chain blockchain, the use of prototype networks to identify and categorize sensitive products or transactions can ensure product compliance and security and detect problems in a timely manner. Analyzing user behavior on the blockchain and identifying abnormal behavioral patterns by learning from fewer samples can enhance the detection of fraud and other malicious activities. By storing hashes of sensitive data on the blockchain, data tampering and leakage can be prevented. Blockchain can also be used to establish and manage smart contracts for data access control, ensuring that only authorized users can access specific sensitive data, thus improving data security. Currently, identification techniques for sensitive data are mainly realized through named entity identification.
Named entity recognition and classification (NERC) is a crucial task in natural language processing (NLP) for extracting information units, such as persons, dates and locations from unstructured text [
12]. Over the years, NER has evolved from rule-based methods to models that leverage deep learning techniques and pre-trained language models.
Early NER tasks depended on lexical features extracted from extensive datasets and named entity libraries [
13]. Rule-based approaches made use of pre-defined patterns that were constructed using features like keywords, syntactic-lexical patterns and statistical data to match strings. Kim [
14] suggested the application of Brill’s rule-based inference approach for speech input, wherein rules are automatically generated based on Brill’s part-of-speech tagger. Quimbaya et al. [
15] introduced a lexicon-driven methodology for named entity recognition (NER) within electronic health records. However, these methods were labor-intensive, costly and rigid.
To address the limitations of rule-based methods, researchers proposed treating NER as a sequence labeling challenge, which led to the development of various models, such as Hidden Markov Models (HMMs) [
16], maximum-entropy Markov models (MEMMs) [
17], support vector machines (SVMs) [
18], and conditional random fields (CRFs) [
19]. These models relied on manually crafted discrete features for sequence labeling. Stapelberg and Keara-Linn [
20] introduced an automatic English speech recognition system combining the Hidden Markov Model and the conditional random field algorithm, aiming to improve the accuracy and stability of the recognition and discuss its application in the field of computers. Krishnan and Manning [
21] introduced a dual-stage methodology utilizing two interconnected conditional random field (CRF) classifiers. The secondary CRF incorporates latent representations obtained from the output of the primary CRF.
The latest progress in deep learning has resulted in the utilization of neural networks for NER tasks. These models autonomously acquire characteristics from the input data, lessening the dependence on manually designed features. The use of unidirectional LSTMs, convolutional neural networks, bidirectional LSTMs with CRFs, and character-level CNNs has been explored in NER literature. The rise of pre-trained language models like BERT and ELMo has significantly enhanced the performance of NER through advanced feature extraction capabilities. L Luo et al. [
22] introduced an attention-based bidirectional Long Short-Term Memory with a conditional random field layer (Att-BiLSTM-CRF) neural network approach for document-level chemical named entity recognition. Liu et al. [
23] presented a novel approach using BERT, BiLSTM and CRF for named entity recognition in citrus pests and diseases, aiming to extract specific entities from unstructured text data to facilitate the construction of a knowledge map for accurate prevention and control methods in agriculture. Yuan et al. [
24] suggested incorporating adversarial training into the model training process as a regularization technique to reduce the impact of noise on the model. In addition, they introduced self-attention to the BiLSTM-CRF model in order to capture significant features that affect entity classification and enhance the accuracy of entity classification. Liang et al. [
4] proposed a Chinese named entity recognition (NER) method and a Relation Extraction (RE) method in the field of Chinese literature. Their approach is built upon the self-attention mechanism, the BiLSTM neural network and the CRF model. Specifically, they introduced a BiLSTM-Self-Attention-CRF framework for NER and a BiLSTM-Multilevel-Attention framework for RE.
Recently, most of the research has been devoted to the study of combining prototypical networks and named entity recognition. The advantages of prototypical networks are their adaptability to few-shot learning and their strong generalization ability, properties that are useful for NER tasks characterized by scarce labeled data and strong entity context dependency. Bin Ji et al. [
25] introduced an entity-level prototypical network (EP-Net) enhanced with decentralized distributed prototypes, where text spans are treated as candidate entities, eliminating the need for labeling dependencies. Ritesh Kumar et al. [
26] introduced ProtoNER, an end-to-end KVP extraction model based on prototypical networks. This model enables the addition of new classes to a pre-existing model with minimal newly annotated training samples. Yucheng Huang et al. [
27] introduced COPNER, which offers a unique prompt consisting of class-specific words to provide supervision signals, enabling contrastive learning to optimize token representations and metric referents for distance-metric inference on test samples.
Overall, NER research has made great progress in recent years, from lexical and rule-based approaches to more complex models utilizing deep learning techniques and pre-trained language models. Currently, there are problems of data scarcity and diversity in sensitive data recognition, while in this paper, prototype networks are applied to named entity recognition, which can not only effectively solve the problems of data scarcity and diversity but also combine the advantages of deep learning to provide fast and accurate recognition capabilities for named entities in various textual data. Therefore, this paper proposes a prototype network-based named entity recognition method combined with deep learning for sensitive data recognition.
4. Experimentation
In this section, we first focus on the dataset, second on the evaluation metrics, including the correctness (P), recall (R) and F1 value, and finally on the comparison and ablation experiments.
4.1. Data Sets
We collected grid traffic data which include common information such as name, time and address. We created a few-shot dataset of 48,821 sentences from these data. We partitioed the dataset according to specific criteria, setting aside the last 10% of the data for testing purposes. The remaining data were then split into training and validation sets at a ratio of 9:1, respectively. The specific division results are shown in
Table 1.
For the ablation study, we used the ChFinAnn dataset and the AdminPunish dataset. In the ChFinAnn dataset, we appropriately combined entities according to their actual meanings in order to simplify the problem to a certain extent. For instance, the terms “HighestTradingPrice” and “LowestTradingPrice” both denote prices; therefore, we consolidated them into a unified entity called “Price”. This integration strategy resulted in a total of 10 categories of entities for the ChFinAnn dataset. The seven entities in the AdminPunish dataset remained unchanged.
Table 2 presents the names of the entities and their respective proportions in the associated datasets.
4.2. Data Process
Traditional preprocessing approaches usually involve mapping the character IDs in a sentence to a multi-dimensional space and then processing these multi-dimensional data using a word2vec model. To enable end-to-end model building, a new approach is to use an embedding layer instead of a word2vec model for the transformation of multi-dimensional features. In this approach, the preprocessing stage simply converts the characters in the sentence to IDs and converts the entities to the corresponding category IDs, thus providing more efficient and accurate input data for the subsequent model. This approach not only simplifies the whole processing flow, but also improves the performance and efficiency of the model.
4.3. Evaluation Indicators
In named entity recognition tasks, the evaluation criteria usually include precision, recall and F1 value. The F1 score is a widely used evaluation metric for assessing the effectiveness of classification models. For each category (e.g., address, book, company, etc.), the F1 score can be calculated. The F1 score combines accuracy and recall, and it provides a good measure of the model’s combined performance in predicting positive and negative examples. Precision refers to the ratio of accurately predicted positive samples, while recall refers to the ratio of accurately predicted actual samples. F1 score represents the harmonized mean of precision and recall, allowing for a more comprehensive evaluation of the model’s performance by incorporating both metrics.
4.4. Experimental Comparison
We used BiLSTM replacing FeedForward to optimize the Transformer structure. After conducting thorough comparative experiments, the study findings indicate that the enhanced Transformer–BiLSTM–CRF model outperforms the LSTM, BiLSTM, CRF, Transformer and their individual combination models in terms of performance [
28]. Based on the data in
Table 3, we trained nine different models and obtained their metrics such as precision, recall and F1 value for detailed comparison and analysis.
Compared with simple neural networks, LSTM has obvious advantages in processing sequence data. LSTM is a unique type of recurrent neural network architecture designed to manage the flow of information using input gates, forget gates and output gates, enabling it to effectively capture long-term dependencies within sequential data [
29]. Compared with traditional forward deep neural networks, LSTM can better capture long-term features in sequence data and is suitable for tasks that require memorizing long-distance information.
For tasks involving more intricate sequence data, the BiLSTM is better suited than a unidirectional LSTM. BiLSTM combines two directions of information flow, forward and backward from the current timestep, so it can better capture the features and patterns amidst the time sequential data, improving the ability of modeling the sequence data, which is suitable for tasks that need to consider contextual information [
30].
Furthermore, the Transformer is a kind of neural network structure that utilizes a self-attention mechanism, comprising numerous layers of feed-forward neural networks. It is adept at dealing with dependencies between positions in an input sequence, better able to learn long-range dependencies and adept at extracting underlying features from data embedded in a high-dimensional space. However, Transformer does not perform well in tasks that require complex combinations of features because its self-attention mechanism is more suitable for capturing local rather than global information and cannot directly accomplish some tasks that require complex combinations of features.
Since machine learning heavily depends on artificial features and the selection of features, CRF itself does not show good performance. However, with the introduction of BiLSTM for feature extraction, CRF can improve its results by effectively combining features through its transition matrix. Superior results can be achieved by more intricate neural network structures. The newly proposed model integrates the strengths of multiple models and surpasses the performance of the other models under comparison. Also, the introduction of Transformer to extract the underlying features is an important improvement. The improved Transformer uses BiLSTM instead of the traditional feed-forward network and performs much better in extracting the underlying temporal features. BiLSTM assists in more effectively extracting and integrating features. Ultimately, the probability distribution of the output is refined by training the transition matrix using CRF.
In the test few-shot dataset, the new model achieves an accuracy of 84.5%, a recall of 85.5%, and an F1 value of 0.85, values which outperform the performance of other compared models. The results indicate that the suggested model has made substantial progress in this task, showcasing its effectiveness and superiority.
4.5. Ablation Experiment
4.5.1. Impact of the Size of the Training Sample
Additionally, the performance of the FSPN-NER model was evaluated with varying training sample sizes. As illustrated in
Table 4 and
Table 5, it can be concluded that the F1 scores of the FSPN-NER models trained with distinct sample sizes on the test samples exhibited notable discrepancies. In comparison to the BiLSTM-CRF model, the F1 scores of the FSPN-NER model demonstrate a gradual improvement with an increase in the training sample size. However, the FSPN-NER model appears to perform more robustly, further substantiating its superiority.
Figure 2 presents the F1 scores obtained from the ChFinAnn and AdminPunish datasets, respectively. A comparison of the two datasets reveals that the F1 scores from the AdminPunish dataset are generally higher than those from the ChFinAnn dataset. This is mainly due to the entity sparsity phenomenon observed in the ChFinAnn dataset.
4.5.2. Impact of Embedded Dimensions
In our study, we examined the effect of word vector dimensionality on model performance. We utilized four distinct dimensions of pre-trained word vectors, including 50, 100, 200 and 300. According to the data in
Table 6, when the word vector dimension is 100, the correctness and recall are higher, and the F1-score reaches the highest value of 73.50. Hence, we opted to establish the dimension of the pre-trained word vectors as 100 in order to achieve optimal model performance. However, when the dimensionality of the word vectors is increased to 200 and 300, the F1-score starts to decrease, which indicates that the dimension of the word vectors is not as large as it should be.
According to
Figure 3, this phenomenon can be explained by the fact that although increasing the dimensionality of the embedded word vectors may result in the acquisition of more information and features, it also adds to the complexity of the model and the use of computational resources, and may even lead to overfitting problems. Furthermore, increasing the dimensionality of embedded word vectors also increases the training and inference time of the model, which is an unwise decision especially when resources are limited. Therefore, in practical applications, we need to weigh the impact of word vector dimensionality on model performance and avoid blindly pursuing higher dimensionality while neglecting the balance between performance and efficiency.