Next Article in Journal
An Intelligent Fuzzy-Based Routing Protocol for Vehicular Opportunistic Networks
Previous Article in Journal
Integrating Motivation Theory into the AIED Curriculum for Technical Education: Examining the Impact on Learning Outcomes and the Moderating Role of Computer Self-Efficacy
Previous Article in Special Issue
Enhancing Personalized Mental Health Support Through Artificial Intelligence: Advances in Speech and Text Analysis Within Online Therapy Platforms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Privacy While Preserving Context in Text Transformations by Large Language Models

by
Tymon Lesław Żarski
* and
Artur Janicki
*
Faculty of Electronics and Information Technology, Warsaw University of Technology, 00-665 Warsaw, Poland
*
Authors to whom correspondence should be addressed.
Information 2025, 16(1), 49; https://doi.org/10.3390/info16010049
Submission received: 21 October 2024 / Revised: 3 January 2025 / Accepted: 11 January 2025 / Published: 14 January 2025

Abstract

:
Data security is a critical concern for Internet users, primarily as more people rely on social networks and online tools daily. Despite the convenience, many users are unaware of the risks posed to their sensitive and personal data. This study addresses this issue by presenting a comprehensive solution to prevent personal data leakage using online tools. We developed a conceptual solution that enhances user privacy by identifying and anonymizing named entity classes representing sensitive data while maintaining the original context by swapping source entities for functional data. Our approach utilizes natural language processing methods, combining machine learning tools such as MITIE and spaCy with rule-based text analysis. We employed regular expressions and large language models to anonymize text, preserving its context for further processing or enabling restoration to the original form after transformations. The results demonstrate the effectiveness of our custom-trained models, achieving an F1 score of 0.8292. Additionally, the proposed algorithms successfully preserved context in approximately 93.23% of test cases, indicating a promising solution for secure data handling in online environments.

1. Introduction

The exponential growth of online activities has intensified concerns over user data security and privacy. The vast volume of data generated and collected, combined with technological advancements, brings new attack vectors and technical oversights. The emergence of the latest artificial intelligence (AI) tools has increased the temptation to process data or solve problems rapidly. This urgency often leads users to overlook the risks of leaking sensitive personal data in exchange for time savings. To address these risks, measures such as limiting data retention have been introduced. This is done to prevent AI systems’ unintended learning of sensitive information, which remains a concern, even with OpenAI’s Zero Data Retention Policy. While OpenAI, in some cases, does not use these data to train models, it may retain them briefly to ensure user interactions align with the intended use and are not harmful [1]. In 2023, 74.4% of the global population (5.16 billion people) were active Internet users, marking a 1.9% increase from 2022 [2]. An estimated 149 zettabytes of data were generated [3], underscoring the immense scale of user-generated content (UGC) [4], whose main sources were the following:
  • Email services: 333.2B emails sent daily;
  • Instant messaging: 41.7M WhatsApp messages and 23.04B SMS texts sent daily;
  • Online social networks (OSN): X (650M daily tweets), Instagram (67.3M daily captions), and Facebook (734.4M daily comments and 421.9M daily status updates).
  • AI tools and services [5]: OpenAI’s ChatGPT (1.5B monthly visits) and CharacterAI (318.8M unique monthly visitors).
Previously mentioned textual data sources, among others, are central to current AI and digitization trends, showcasing versatility in applications. These range from commercial companies leveraging email content in prompts for large language model (LLM) and Retrieval Augmented Generation frameworks to private users transforming or extracting insights from their OSN messages or posts. However, as the reliance on textual data grows, so do concerns about personal data privacy. This has led to the development of various anonymization tools and techniques to protect sensitive information in textual data. Existing solutions are based on frameworks like Stanford’s CoreNLP [6] or spaCy [7] that primarily utilize machine learning (ML) named entity recognition (NER) approaches and traditional natural language processing (NLP) techniques to detect personally identifiable information (PII). Beyond NER techniques, alternative methods like differential privacy [8] or homomorphic encryption [9] have gained traction in recent research. Techniques such as data masking, k anonymity [10], l diversity [11], and fictional LLM data generation [12] continue to evolve, offering additional privacy protection by altering or generalizing data to prevent re-identification.
In this paper, we rely on the definition of personal and sensitive data established by the European Parliament and Council of the European Union, particularly the General Data Protection Regulation (GDPR). Definitions of personal data are based on regulation [13], and sensitive data on GDPR law [13]. In essence, any set of information that, when combined, can identify an individual is considered personal or sensitive data. It is important to note that even if personal information has been de-identified, encrypted, or pseudonymized, it remains protected under the GDPR if it can be used to re-identify a living person.
This research addresses these challenges by developing an efficient and adaptable conceptual solution to achieve higher detection accuracy while preserving the semantic integrity of the data. Unlike traditional solutions, CleanText leverages bespoke algorithms and uses LLMs to handle unstructured data effectively, ensuring that the anonymized text remains valuable for downstream applications. We envision the ideal application of our work as either a solution tailored for the end user or a highly customizable, independent platform. In this setup, the algorithms operate in the back end as a proxy for data processing. This approach is advantageous when using online services that lack on-premise deployment options, have terms of service that do not guarantee data security, or are prohibitively expensive for enterprise use. By incorporating a comprehensive suite of privacy-preserving techniques, CleanText addresses these limitations and offers a solution for data de-anonymization. The main contributions of this paper can be summarized as follows:
  • Training and tuning of models from the chosen toolkit libraries (alongside publishing of the Polish MITIE embedding model);
  • Evaluation of created models;
  • Design and implementation of algorithms for data (de-)anonymization,
  • Testing of the effectiveness of the complete system with a selected ML model in a possible real-life use case.

1.1. Structure

The rest of this paper is organized as follows: Section 1.2 contains a brief review of related work in the field of sensitive data detection and anonymization. Section 2.2 presents the algorithms for anonymizing and de-anonymizing the processed text. Section 2.1 describes the construction of the proposed sensitive data detection system in detail. Section 3 refers to initial experiments using the created NER models and the finished system and shows their results. The article ends with a discussion of the results (Section 4) and conclusions (Section 5).

1.2. Related Work

To address the challenges of NER, prior work has explored diverse approaches, including rule-based systems [14,15], ML algorithms [16,17,18], and hybrid models that combine multiple techniques for enhanced performance. Most solutions for detecting sensitive data are often tacked with NER techniques, where hybrid systems [19,20] that combine ML and rule-based NER methods yield the best results.
Several studies have focused on extracting and classifying sensitive data to aid organizations in legal and security compliance. For instance, ref. [21] developed a component for extracting and classifying sensitive data from unstructured text in European Portuguese. They utilized a hybrid approach for NER that combines rule-based methods, ML, and neural network techniques. Their bidirectional long short-term memory model achieved the highest performance, with an F1 score of 83.01%, significantly outperforming conditional random fields, which scored 65.50%. A common challenge identified in such studies is the lack of publicly available labeled datasets, particularly in the financial sector. Another study [22] demonstrated the potential of LLMs, such as fine-tuned KB-BERT, for detecting PII in Swedish learner essays, achieving up to 90.17% recall. Their study highlights the ability of LLMs to approximate human intuition in identifying sensitive information in unstructured and varied text. The authors of [23] developed a novel hybrid system called CASSED, leveraging the FLAIR [24] framework in PyTorch [25] to detect structured sensitive data. Their method creates a column context by combining metadata and cell values, which is processed as a single input for a natural language embedding model, specifically using BERT for classification. This system, enhanced by rule-based methods, outperforms existing models in precision, recall, and F1 score on the DeSSI [26] dataset and significantly improves results on the WikiTables dataset.
Moreover, the following solutions are fundamentally relevant to our research, demonstrating advanced techniques in sensitive data detection and de-identification, which are critical for ensuring data privacy and security in real-world applications. In another study [12], the authors developed a robust and flexible system for detecting and anonymizing sensitive information in text using open-source LLMs integrated via LangChain. The system achieved a 99% success rate across 100 test files, significantly outperforming standalone LLMs while ensuring compliance with the GDPR. Its adaptable design supports the integration of various LLMs, enabling applications across industries such as healthcare, legal, and customer service, and offers potential for further optimization and model enhancements. Similarly, ref. [27] proposed a method for de-identifying personal and sensitive information by detecting named entities using spaCy’s NER based on a convolution neural network (CNN) and replacing them with class names to protect privacy in legal documents. This approach underscores the importance of masking confidential information to develop ML models without compromising data privacy. Furthermore, ref. [28] introduced DePrompt, a framework designed to enhance the privacy and effectiveness of prompts used with LLMs. By employing fine-tuning techniques and contextual attribute analysis, DePrompt achieves high-precision identification of PII while preserving semantic content through adversarial generative desensitization methods. Experimental evaluations demonstrate that DePrompt outperforms benchmarks, providing superior privacy protection and maintaining high-quality model inference results, making it adaptable to diverse text usability scenarios.
Our literature review summary (see Table 1) shows that solutions mainly focus on sensitive data detection. Recent work also focuses on entity anonymization but rarely on bidirectional privatization behavior, allowing for data retrieval. The proposed solution follows state-of-the-art standards by implying detection based on a hybrid NER system combined with breakthrough algorithms for data anonymization and de-anonymization utilizing an LLM, allowing for preservation of content after text transformation.

2. Materials and Methods

The main aim of our experiment was to understand if it is possible to utilize LLM and NER technologies to anonymize text successfully while preserving its context, with the ability to de-anonymize sensitive and private data later. For testing, we established a system processing flow (illustrated in Section 3.2) consisting of four main modules, seamlessly integrated with OpenAI services, as depicted in Figure 1. This section delves into the sensitive data detection microservice, focusing on the NLP pipeline, NER model preparation, and the design of (de-)anonymization algorithms.

2.1. Creating a Sensitive Data Detection System

Beyond the pipeline implementation, we need to discuss the NER tools used within the system and the dataset analysis needed to determine the final corpus. Based on the differences in the implementation, we decided to split the detected categories into ML (detected by model) and rule-based (extracted by regular expressions) approaches. Finally, the following subset of categories was proposed after setting the limitation for the experiments and understanding the definitions of sensitive and personal data:
  • ML approach:
    -
    Name and surname—PERSON;
    -
    Addresses—LOCATION;
    -
    Public companies and corporate organizations—ORGANIZATION;
    -
    Racial or ethnic origin—ORIGIN.
  • Rule-based approach:
    -
    Email addresses—EMAIL;
    -
    website URLs—LINK.
We observed the available spectrum of inside–outside–beginning-labeled (IOB) datasets to fulfill the above-mentioned category requirements. However, none of the datasets individually met all the necessary criteria. To this end, we analyzed several datasets, including MultiNERD [29], WikiNEuRal [30], WikiNER [31], CEN [32], NKJP [33], and KPWR [34]. All datasets were evaluated after being remapped to a unified set of labels. The label density was then analyzed to ensure the combined dataset maintained a balanced label distribution without disproportionately under-representing any classes. The most complete and balanced combination is a mix of the CEN and KPWR datasets. Together, they cover all categories of requirements. The final dataset entity counts are presented in Table 2.
Among all the available tools, experiments were conducted on two toolkits: spaCy and MIT Information Extraction (MITIE) [35]. SpaCy and MITIE are the designated candidates because their NER model training hinges on their distinct approaches.
SpaCy’s NER system employs CNN-based deep learning models to identify named entities. It starts by tokenizing text and converting tokens into numerical representations using Word2Vec embeddings, capturing their meaning and context. Features are extracted through CNN filters and classified using a linear layer with a softmax function. Post-processing ensures accurate and consistent entity identification, with options for fine tuning to improve accuracy. Conversely, MITIE uses a support vector machine (SVM) model with embeddings generated by the wordrep tool via two-step canonical correlation analysis [36]. MITIE’s training involves segmenter and classifier phases on pre-trained embeddings and linguistic features, with non-tunable hyperparameters determined by the trainer.
Unfortunately, when conducting our experiments, the MITIE library did not support the embedding model for the Polish language. As a product of our work, we trained an embedding model for the Polish language [37] based on the 45 GB of the Universal Dependencies [38] dataset (the model can be accessed at the https://huggingface.co/tymzar/mitie-polish (accessed on 2 January 2025)).
The core of the microservice designed for the tests is the NLP pipeline, which combines ML and rule-based NER approaches. The following part of the system needs to be flexible and modular instead of hard-coded logic to prepare the system for future work and improvements. To make the pipeline as non-restrictive as possible, we propose that the class constructor take three parameters: a list of steps, the name of the available NER model, and an input. All parameters should have predefined type aliases for a developer-friendly implementation, and the key feature of the pipeline is the ability to process steps asynchronously.
The pipeline runner is instructed to aggregate consecutive steps, with true values set by a boolean flag on the pipeline step definition. Second, preprocessed declared steps are processed individually or multiprocessed if the aggregated process block is encountered. Finally, the result of each step’s execution is a dictionary containing newly detected entities instead of the next pipeline state to avoid over-riding the data during multiprocessing. It is enormously beneficial for the rule-based part of the system because all the steps are processed simultaneously, and computational overhead from regex processing and predicate match validation is significantly reduced.

2.2. Designing Algorithms for Sensitive Data Privatization

The most significant contribution of our work is a set of generic algorithms designed to enhance the privacy of processed unstructured data. These algorithms leverage NER models to detect sensitive and private data with fictional information while maintaining the original context. To achieve this, we developed encoding and decoding methods. We present these algorithms in pseudo-code to emphasize their agnostic nature regarding PII detection methods, which may benefit future researchers. The solution holds potential value for back-end servers of browser extensions designed to oversee the prompt input fields of various LLM chatbot services. Additionally, it could be integrated into more complex systems that utilize LLM APIs for, e.g., tasks such as text transformation. The following walk-through is based on the pipeline and NER model introduced in Section 3.
The encoding algorithm protects personal data by substituting identifiable information with imaginary data. The process described by Algorithm 1 begins by analyzing the input text using an NLP pipeline that identifies and tags named entities. Each detected entity is then replaced with a unique “wild tag”, which is formatted as <NER_LABEL "HASH">, where NER_LABEL is the type of the named entity and HASH is a cryptographic hash derived from the entity’s content. This ensures that the original information cannot be easily retrieved or guessed without access to the corresponding decryption key, which introduces an abstract layer for data manipulation. These mappings between the original entities and their respective wild tags are stored locally in a hashmap for later use.
Algorithm 1: Anonymization of sensitive data
Information 16 00049 i001
Once all named entities have been replaced with wild tags, the text is fed into a LLM. We used OpenAI’s gpt-4o-mini via the published batch application programming interface (API) [39] for our experiments. The LLM uses a custom prompt with instructions to replace each wild tag with a random piece of imaginary information, further obfuscating the original data and generating a response of already modified text with a hash map documenting the substitutions. A new map is then generated and saved, linking each wild tag to the latest random information provided by the LLM. This map is crucial for the reverse process of decoding, where the original data must be reconstructed from the obscured text.
The decoding algorithm is the counterpart to the encoding process, enabling the retrieval of the original personal data from the encoded text. The decoding flow shown in Algorithm 2 begins by retrieving the hashmap that the LLM produced during the encoding phase, which contains mappings from wild tags to randomly generated imaginary information. This hashmap is passed back to the LLM within the prompt, instructing the replacement of the imaginary information in the encoded text with the corresponding wild tags. To address any context issues, the template for the prompt explicitly contains instructions for the LLM to look for the fictional entities available in the map and ensure that the given label inside the wild tag matches the replaced predicate context in the anonymized text. Additionally, we instruct the LLM to output only the portions of text containing the wild tags, ensuring precise and contextually accurate responses.
Algorithm 2: De-anonymization of sensitive data
Information 16 00049 i002
Following this, a script processes the partially reconstructed text, now containing wild tags instead of random information. The script accesses a previously saved map with associations between wild tags and the original named entities. By iterating through the available wild tag keys, the text is gradually restored to its original form by substituting each wild tag with its corresponding original entity from the map.

3. Experiments and Results

As discussed in the previous section, the proposed methods require NLP with an NER model to be plugged into the described algorithms. We use the created IOB-labeled dataset to train both spaCy and MITIE models. When the preferred ML model can be selected, we incorporate it into the processing pipeline the previously mentioned algorithms utilize. The processes of obtaining and evaluating optimal models, preparing data for algorithm evaluations, performing final tests, and interpreting the obtained results are discussed in the given sections.

3.1. Machine Learning Model Selection

This subsection presents an analysis and comparison of models from already mentioned toolkits, spaCy and MITIE, to select the most efficient detection tool for further development. To evaluate the performance of NER models, we employed a range of metrics, including F1 score, accuracy, precision, recall, and visual presentation confusion matrices (Equation (1)).
Confusion Matrix = C 11 C 12 C 1 j C 21 C 22 C 2 j C i 1 C i 2 C i j
where class C i j represents the number of instances of the actual class (i) predicted as class (j).
Each model was trained on the training (19,750 sentences) and validation (3041 sentences) dataset splits. In addition, the CNN model was fine-tuned based on spaCy’s version two TransitionBasedParser [40] architecture. To optimize performance, we experimented with a combination of parameters like batch size, learning rate, hidden-layer width, and maxout pieces with the help of the Weights & Biases (wandb) [41] platform. Finally, with the Bayes method for the best-tuned network, we achieved a batch size of 256, learning rate of 0.0006121, hidden-layer width of 64, dropout of 0.08256, and 3 maxout pieces, which resulted in a score of 0.8292. Figure 2 presents graphs with the runs with selected parameters against the obtained F1-score metric.
We conducted a hyperparameter importance analysis [42] using wandb to refine our optimization process further. This analysis uses a tree-based model to assess each parameter’s impact on the chosen metric, identified dropout (importance: 0.461; correlation: −0.842), and maxout pieces (importance: 0.432; correlation: 0.740) as the most influential parameters. The results from this critical analysis helped us understand the tuning of the most impactful parameters, improving the efficiency of our optimization.
Finally, the classification models were evaluated using a test dataset (3066 sentences), and results were collected in Table 3. Preliminary performance analysis extracted from the training data showed that MITIE, in contrast to spaCy, obtained noticeably better results in classifying categories containing fewer samples.

3.2. Anonymization of Sensitive Data Using CleanText Pipeline Integration

To determine how the proposed flow and algorithms handle text with sensitive data, we prepared an anonymizer tool to securely and quickly handle users’ data. Therefore, a module in the dashboard application (Figure 1) was developed to process the data in the controlled environment.
The web tool was developed using React and Next.js, which provided a robust foundation for building a high-performance and dynamic user interface. The tool features a block editor implemented with the Editor.js library, enabling rich text and content editing capabilities. A WebSocket-based web service was integrated to optimize latency by loading models dynamically during connection initialization, ensuring efficient performance. The user interface was designed with Tailwind CSS and NextUI, leveraging their customizable design systems and prebuilt components to enhance flexibility and maintain a cohesive visual style.
To provide the best possible user experience, the web page has an intuitive interface (Figure 3) that includes two text editors—one for text containing sensitive data and the other for anonymized or processed text. The interface also includes a statistics counter tile with a dropbox for selecting the model used for detection or performing (de-)anonymization actions. The entire process of utilizing the proposed flow can be described in the following steps:
  • User initializes the session by signing in to the baseboard using credentials;
  • User navigates to the detection module by clicking the detectionlink (Figure 3);
  • User pastes or types in the text containing sensitive information to be anonymized;
  • Text is automatically annotated by the NER pipeline 1500 milliseconds after the last input (Figure 4b);
  • Annotated text is anonymized by pressing the anonymize text button(Figure 4a);
  • User performs operations on the anonymized text, e.g., uses an LLM to summarize the text;
  • After modification, text is pasted back into the anonymized text editor (Figure 4c);
  • Text is de-anonymized, and real information is presented after pressing the revert anonymized text button (Figure 4c).
Figure 3 and Figure 4 depict the tool interface in particular states, such as the default module state (Figure 3), the statistic counter after the annotation (Figure 4a), editor with annotated text (Figure 4b) and the (de-)anonymization result obtained by the algorithm (Figure 4c,d).

3.3. Tests Performed on the CleanText Pipeline

Although the proposed interface achieves promising results in data anonymization, more expensive tests must be conducted on the mentioned process to assess its performance and reliability. The scheme for the final evaluation of the CleanText pipeline is based on the processing of 10,000 news headlines from the Polish-news [43] dataset containing almost 250,000 articles. The goal is first to detect all entities in the given headline and, next, to swap those for fictional examples using an anonymization algorithm. Furthermore, the text is summarized using the LLM. Finally, in the end, the original data are restored using a de-anonymization algorithm; the data are assessed based on the detected issues and saved context. The performed tests’ results are presented in Section 4. We used the OpenAI Batch API processing feature at every process step to reduce costs and be more efficient. The system tests are outlined as follows:
  • A total sample of 10,000 news articles with the same seed as the datasets were generated.
  • All the chosen articles were processed in bulk with the sensitive data detection pipeline, saving the result as an OpenAI batch file.
  • All sensitive entities were anonymized by converting them to hashes and replacing them with fictional data using an LLM for every article at once.
  • The results were parsed, saving hash maps for corresponding headlines.
  • The batch was prepared for content summarization.
  • Summarized files were collected and combined with previously saved maps for data de-anonymization.
  • The final summarized texts with retrieved data were extracted, with the restored data compared with the original data using NLP and static comparison techniques.

3.3.1. Anonymization of the Detected Entities

The testing process started with sampling the Polish-news dataset and selecting headlines with over 35 words containing at least five sensitive entities. The algorithm later processed all the samples, formed them into a batch, and dispatched them to the OpenAI API endpoint. The process finished after 1 h 45’, yielding an output with all of the responses from the processed batch. Next, the results were carefully evaluated based on three main factors: whether the results were parsable, whether the format was in accordance with the schema, and whether the replaced entities were correctly mapped to the corresponding wild tags. Table 4 contains the findings of the evaluation. We can see that the first step of our flow is successful in 94.58% of the tested cases, despite the use of a smaller model and the non-deterministic behavior of the LLM. The leading cause of failures was incorrect entity map generation, which, in the failed cases, missed some of the entities or provided a javascript object notation (JSON) schema that was impossible to parse due to syntax errors or incorrect object keys.

3.3.2. LLM-Anonymized News Content Transformation

For this case scenario, we used a simple prompt—“My goal is to paraphrase the text in Polish. If you input the text in Polish, I will provide the output in the form of paraphrased text in Polish”—to paraphrase all of the correctly anonymized news headlines. The result of this process was based on comparing the white space-trimmed headline hashed using SHA256 [44] before and after paraphrasing. Results in the tabular form can be found in Table 5.

3.3.3. De-Anonymization of the Detected Entities

Following paraphrasing anonymized news headlines, the next step involved de-anonymizing the previously masked entities. This process aimed to restore the original named entities using an entity map that had been stored prior to anonymization. This entity map was served in the prompt as a reference for reinserting the correct names or terms in the paraphrased text by the LLM. The results of this process are summarized in Table 6. Most headlines were correctly de-anonymized, with entities matching the entity map. However, 127 headlines had missing or altered entities due to paraphrasing, and 7 cases failed as the result of an invalid format, such as non-parsable JSON or refusal to generate the response.

4. Discussion

With the advancement of AI and digitalization, users’ data are increasingly exposed to significant risks, prompting the development of enhanced privacy protection solutions, such as the DuckDuckGo browser. Evidence from the adoption of existing tools demonstrates a growing demand for such measures, as users actively seek technologies that provide a greater sense of security.
This study proposes a complex end-to-end solution for adding a layer of privacy to users’ textual data. CleanText analyzes text using a hybrid approach based on ML and rule-based NER, allowing for sensitive data removal by anonymization and, in the end, complete restoration of the previously masked entities, functioning as a proxy. The solution emphasizes essential features that create a flexible and generic framework.
Figure 5 and Figure 6 preset the confusion matrices of both NER models. The analysis of the figures shows high precision in classifying rule-based entities, which was expected due to the predictable nature of the classes. Regarding the ML prediction, we notice the most consistent results for the ORIGIN category across two models, likely due to their low correlation with other categories. However, some contextual overlaps created small challenges between ORIGIN and LOCATION classification. Additionally, the ORGANIZATION and PERSON classes are sources of clashes when company names include personal names. Similarly, ORGANIZATION with LOCATION or PERSON class misclassifications occur when organizations are named after cities or the founders, such as “Sejm Warszawski”. To address issues with overlapping classes, we propose fine tuning of the NER model with an enriched training dataset that includes more examples of ambiguous entities in the given class. This thorough approach ensures that the model is well-equipped to handle a wide range of scenarios. Alternatively, we can use post-processing rules to ensure the labels of annotated entities are correct. Combining this information with the evaluation results from Table 3 that include F1-score and accuracy metrics, together with a complete understanding of the confusion matrix, we concluded that MITIE is the best candidate for a default system model and tests. MITIE’s advantages in the F1-score and overall accuracy metric (Table 3), as well as its less ambiguous and more consistent behavior (Figure 5), make it a compelling choice. In addition to these metrics, MITIE and the wordrep tool enable the expansion of multilingual support for morphological languages, requiring only large amounts of unstructured text in the target language.
The results of tests of the performance of the CleanText pipeline are collected in Table 7. Analyzed texts differed in complexity and length; where the minimal requirement was 35 words and at least five present entities. The designed algorithms performed almost flawlessly, reaching 93.23% on the chosen dataset and showing great potential in anonymization, achieving a correctness score of 94.58% and a de-anonymization of level of 98.58% for various scenarios. On the other hand, it is worth mentioning that more complex examples and usage of a smaller LLM could affect the results of the anonymization step, which can explain the five percentage points in the success rate. After manually confirming the failed examples of de-anonymization, we concluded that samples containing long or adjacent entities of the same type were often incorrectly retrieved, primarily due to the changed context that later influenced the LLM’s text summarization. According to the system results mentioned above, the main reason for the failures is the lack of clear boundaries between instances tagged with the inner and outer tags when the NER model detects entities. Hence, we once again notice that it is essential to focus on optimizing the NER model using the more extensive and diverse IOB label datasets to achieve more insightful detection to better instruct the LLM on replacing sensitive information.
Nevertheless, this study’s primary objective was to introduce ways to increase privacy while keeping the context of the text. We can say that this objective was fulfilled. Although it may not be as secure as using the on-premise solution provided by some service providers, it is an excellent alternative in situations where the service or tool provider does offer such an option.

5. Conclusions

In this article, we present the results of our work on sensitive data anonymization systems to increase privacy. Our case study focuses on designing system component algorithms to form a solution capable of increasing privacy by removing and reinstating sensitive data, even after further processing the feature missing in the most recent research. The results of our work, when implemented as a global service, allow day-to-day and commercial users to use online LLM tools without risking their sensitive data, even when working with services that do not provide on-premise solutions that guarantee data security.
During the research, we conducted a comparative analysis of NLP technologies for NER, experimenting with and introducing a new MITIE model for the Polish language and spaCy, an industry-standard framework, ultimately selecting MITIE as the most suitable one. We also proposed and implemented algorithms for anonymization and de-anonymization to add a layer of privacy. An exemplary dashboard panel was proposed to visualize the processing process and display system capabilities as a sensitive data removal tool for on-demand text processing. Finally, the NER model and detection pipeline tests were evaluated, concluding that the label categories of EMAIL, LINK, and ORIGIN had the highest F1 scores. The misinterpretation of the ORGANIZATION, LOCATION, and PERSON categories is acceptable from the perspective that some of company names can contain the names of the founders or location names like cities or countries. The system also successfully anonymized and de-anonymized detected entities without changing context in almost 92.23% of the test cases. We discovered from the performed tests that the system struggled to recognize the boundaries between neighboring entities and keep the correct response format.
Several improvements could be made to enhance the NER system in future work. Starting with the limitations discovered through our research, we can resolve category overlap by introducing an additional entity class or supplying even more data. Further extra features beneficial in this line of work include the expansion of the system’s coverage to more languages by parameterizing the pipeline, creating a balanced, manually annotated dataset, and experimenting with advanced models like BERT or Llama. Moreover, the training or fine tuning of open-source LLMs for sensitive data substitution and the implementation of a scoring system to evaluate anonymization accuracy are necessary. Additionally, increasing user accessibility through a browser extension for easier data anonymization on third-party web applications or integrating it into voice recognition systems for real-time speech-to-text entity removal during virtual meetings for AI assistants could provide significant advantages.

Author Contributions

Conceptualization, A.J. and T.L.Ż.; methodology, T.L.Ż.; software, T.L.Ż.; validation, T.L.Ż.; formal analysis, T.L.Ż.; investigation, T.L.Ż.; resources, T.L.Ż.; data curation, T.L.Ż.; writing—original draft preparation, T.L.Ż.; writing—review and editing, A.J. and T.L.Ż.; visualization, T.L.Ż.; supervision, A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset used in this study combines the CEN [32] and KPWR [34] datasets, which were selected for their comprehensive and balanced coverage of the required categories. Both datasets were remapped to a unified label set, and label density was evaluated to ensure a well-distributed representation across all entity classes. A Jupyter notebook with a simplified implementation of the algorithms used in this study is available upon request. The trained model used during our research experiments has been published and is available at https://huggingface.co/tymzar/mitie-polish (accessed on accessed on 1 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
APIApplication Programming Interface
CNNConvolution Neural Network
GDPRGeneral Data Protection Regulation
IOBInside–outside–beginning
JSONJavaScript Object Notation
LLMLarge Language Model
NERNamed Entity Recognition
NLPNatural Language Processing
OSNOnline Social Network
PIIPersonally Identifiable Information
RAGRetrieval Augmented Generation
SVMSupport Vector Machine
UGCUser-generated Content
wandbWeights and Biases

References

  1. Thakur, K. OpenAI Data Policies: Data Usage and Retention Period. DreamInForce. 2024. Available online: https://www.dreaminforce.com/openai-data-policies-data-usage-retention/ (accessed on 6 December 2024).
  2. Meltwater, W.A.S. Digital 2023 Global Overview Report. Available online: https://datareportal.com/reports/digital-2023-global-overview-report (accessed on 15 May 2023).
  3. Berisha, B.; Mëziu, E.; Shabani, I. Big data analytics in Cloud computing: An overview. J. Cloud Comp. 2022, 11, 24. [Google Scholar] [CrossRef]
  4. Rayaprolu, A. How Much Data Is Created Every Day in 2023. Available online: https://techjury.net/blog/how-much-data-is-created-every-day (accessed on 16 November 2023).
  5. Sarkar, S. AI Industry Analysis: 50 Most Visited AI Tools and Their 24B+ Traffic Behavior. Available online: https://writerbuddy.ai/blog/ai-industry-analysis (accessed on 9 February 2024).
  6. Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014; pp. 55–60. [Google Scholar]
  7. Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. spaCy: Industrial-Strength Natural Language Processing in Python. 2020. Available online: https://zenodo.org/records/10009823 (accessed on 2 October 2024).
  8. Dwork, C.; Roth, A. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
  9. Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic Encryption for Arithmetic of Approximate Numbers. In Advances in Cryptology—ASIACRYPT 2017; Springer: Cham, Switzerland, 2017; pp. 409–437. [Google Scholar] [CrossRef]
  10. El Emam, K.; Dankar, F.K. Protecting privacy using k-anonymity. J. Am. Med. Inform. Assoc. 2008, 15, 627–637. [Google Scholar] [CrossRef]
  11. Li, T.; Li, N.; Cao, J.; Yang, W. Slicing: A New Approach for Privacy Preserving Data Publishing. IEEE Trans. Knowl. Data Eng. 2018, 24, 561–574. [Google Scholar] [CrossRef]
  12. Böhlin, F. Detection & Anonymization of Sensitive Information in Text: AI-Driven Solution for Anonymization. Bachelor’s Thesis, Linnaeus University, Linneuniversitetet, Sweden, 2024. Available online: https://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-131672 (accessed on 14 November 2024).
  13. European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council. Off. J. Eur. Union 2016, 119, 1–88. Available online: https://data.europa.eu/eli/reg/2016/679/oj (accessed on 6 December 2024).
  14. Madan, A.; George, A.M.; Singh, A.; Bhatia, M.P.S. Redaction of Protected Health Information in EHRs using CRFs and Bi-directional LSTMs. In Proceedings of the 7th International Conference on Reliability, Infocom Technologies and Optimization, Noida, India, 29–31 August 2018. [Google Scholar] [CrossRef]
  15. Neamatullah, I.; Douglass, M.M.; Lehman, L.W.; Reisner, A.; Villarroel, M.; Long, W.J.; Szolovits, P.; Moody, G.B.; Mark, R.G.; Clifford, G.D. Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 2008, 8, 32. [Google Scholar] [CrossRef] [PubMed]
  16. Chen, T.; Cullen, R.M.; Godwin, M. Hidden Markov model using Dirichlet process for de-identification. J. Biomed. Inform. 2015, 58, S60–S66. [Google Scholar] [CrossRef]
  17. Shweta; Kumar, A.; Ekbal, A.; Saha, S.; Bhattacharyya, P. A Recurrent Neural Network Architecture for De-identifying Clinical Records. In Proceedings of the 13th International Conference on Natural Language Processing, Varanasi, India, 17–20 December 2016; Available online: https://aclanthology.org/W16-6325 (accessed on 30 October 2024).
  18. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  19. Holla, A.; Gaind, B.; Katta, V.R.; Kundu, A.; Kamalesh, S. Hybrid NER System for Multi-Source Offer Feeds. arXiv 2019, arXiv:1901.08406. [Google Scholar] [CrossRef]
  20. Meselhi, M.A.; Abo Bakr, H.M.; Ziedan, I.; Shaalan, K. Hybrid Named Entity Recognition—Application to Arabic Language. In Proceedings of the 9th International Conference on Computer Engineering & Systems (ICCES), Cairo, Egypt, 22–23 December 2014; pp. 80–85. [Google Scholar] [CrossRef]
  21. Dias, M.; Boné, J.; Ferreira, J.C.; Ribeiro, R.; Maia, R. Named Entity Recognition for Sensitive Data Discovery in Portuguese. Appl. Sci. 2020, 10, 2303. [Google Scholar] [CrossRef]
  22. Szawerna, M.; Dobnik, S.; Sánchez, R.; Tiedemann, T.; Volodina, E. Detecting Personal Identifiable Information in Swedish Learner Essays. In Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization, St. Julian’s, Malta, 17–22 March 2024; pp. 54–63. Available online: https://aclanthology.org/2024.caldpseudo-1.7 (accessed on 6 December 2024).
  23. Kužina, V.; Petric, A.; Barišić, M.; Jović, A. CASSED: Context-based Approach for Structured Sensitive Data Detection. Expert Syst. Appl. 2023, 223, 119924. [Google Scholar] [CrossRef]
  24. Akbik, A.; Bergmann, T.; Blythe, D.; Rasul, K.; Schweter, S.; Vollgraf, R. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 54–59. [Google Scholar]
  25. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 8–14 December 2019; pp. 8026–8037. [Google Scholar]
  26. Kaggle; Kužina, V.; Petric, A.; Barišić, M.; Jović, A. DeSSI Dataset for Structured Sensitive Information. Available online: https://www.kaggle.com/datasets/sensitivedetection/dessi-dataset-for-structured-sensitive-information (accessed on 12 December 2023).
  27. Kutbi, M. Named Entity Recognition Utilized to Enhance Text Classification While Preserving Privacy. IEEE Access 2023, 11, 117576–117581. [Google Scholar] [CrossRef]
  28. Sun, X.; Liu, G.; He, Z.; Li, H.; Li, X. DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts. arXiv 2024, arXiv:2408.08930. Available online: https://arxiv.org/abs/2408.08930 (accessed on 17 November 2024).
  29. Tedeschi, S.; Navigli, R. MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation). In Proceedings of the Findings of the Association for Computational Linguistics: NAACL, Seattle, WA, USA, 10–15 July 2022; pp. 801–812. [Google Scholar] [CrossRef]
  30. Tedeschi, S.; Maiorca, V.; Campolungo, N.; Cecconi, F.; Navigli, R. WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2521–2533. [Google Scholar] [CrossRef]
  31. Ikhwantri, F. Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER. arXiv 2019, arXiv:1907.11158. [Google Scholar] [CrossRef]
  32. Marcińczuk, M. CLARIN-PL Digital Repository—CEN. 2007. Available online: http://hdl.handle.net/11321/6 (accessed on 13 June 2024).
  33. Degórski, Ł.; Przepiórkowski, A. Recznie znakowany milionowy podkorpus NKJP. In Narodowy Korpus Języka Polskiego; Cierkonska, J., Ed.; Naukowe PWN: Warsaw, Poland, 2012; pp. 51–57. [Google Scholar]
  34. Broda, B.; Marcińczuk, M.; Maziarz, M.; Radziszewski, A.; Wardyński, A. KPWr: Towards a Free Corpus of Polish. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, Istanbul, Turkey, 23–25 May 2012; pp. 3218–3222. [Google Scholar]
  35. King, D.E. Dlib-ml: A Machine Learning Toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
  36. Dhillon, P.; Rodu, J. Two Step CCA: A New Spectral Method for Estimating Vector Models of Words. 2012. Available online: https://arxiv.org/abs/1206.6403 (accessed on 3 September 2024).
  37. Żarski, T. Trained MITIE Model for the Polish Language Published on Hugging Face. Available online: https://huggingface.co/tymzar/mitie-polish (accessed on 1 October 2024).
  38. Nivre, J.; Zeman, D.; Ginter, F.; Tyers. Universal Dependencies. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts, Valencia, Spain, 3–7 April 2017. [Google Scholar]
  39. Chat Completion API Reference. Available online: https://platform.openai.com/docs/api-reference/chat/create (accessed on 5 May 2024).
  40. spaCy API Documentation; Explosion AI. Parser Architectures. Available online: https://spacy.io/api/architectures#parser (accessed on 8 December 2024).
  41. Biewald, L. Experiment Tracking with Weights and Biases. 2020. Available online: https://www.wandb.com/ (accessed on 12 September 2024).
  42. Parameter Importance Documentation. Available online: https://docs.wandb.ai/guides/app/features/panels/parameter-importance (accessed on 15 November 2024).
  43. Hugging Face—WiktorS/Polish-News Dataset. Available online: https://huggingface.co/datasets/WiktorS/polish-news (accessed on 5 September 2024).
  44. National Institute of Standards and Technology. Secure Hash Standard (SHS); Federal Information Processing Standards Publication 180-4; U.S. Department of Commerce: Gaithersburg, MD, USA, 2015. Available online: https://csrc.nist.gov/pubs/fips/180-4/upd1/final (accessed on 22 June 2024).
Figure 1. Processing flow and visualization of system components.
Figure 1. Processing flow and visualization of system components.
Information 16 00049 g001
Figure 2. Display of parallel coordinates with resulting F1 score of the run.
Figure 2. Display of parallel coordinates with resulting F1 score of the run.
Information 16 00049 g002
Figure 3. Detection page with selected MITIE NER model.
Figure 3. Detection page with selected MITIE NER model.
Information 16 00049 g003
Figure 4. Detection page displaying (a) the statistics and (b) results of NER of the provided text, (c) the results of the anonymization process models, and (d) summarized text with received previously anonymized personal information.
Figure 4. Detection page displaying (a) the statistics and (b) results of NER of the provided text, (c) the results of the anonymization process models, and (d) summarized text with received previously anonymized personal information.
Information 16 00049 g004
Figure 5. Confusionmatrix of the SVM MITIE model.
Figure 5. Confusionmatrix of the SVM MITIE model.
Information 16 00049 g005
Figure 6. Confusion matrix of the CNN spaCy model.
Figure 6. Confusion matrix of the CNN spaCy model.
Information 16 00049 g006
Table 1. Comparison of methods for sensitive data detection and processing.
Table 1. Comparison of methods for sensitive data detection and processing.
No.PaperProblemApproach
1[12]PII detection and anonymization in textCNN (spaCy), rule-based layer, and LLM
2[21]PII detection in textBidirectional-LSTM and rule-based layer
3[22]PII detection in textKB-BERT
4[23]PII detection in structured dataBERT and rule-based layer
5[27]PII detection and removal in textCNN (spaCy) and rule-based layer
6[28]PII detection and anonymization in textLLM (detection) and adversarial generative desensitization
7This articlePII detection and (de-)anonymization in textTwo-Step CCA (MITIE), SVM, rule-based layer, and LLM (desensitization)
Table 2. Entity counts across final dataset splits.
Table 2. Entity counts across final dataset splits.
SplitORGANIZATIONPERSONLOCATIONMONEYORIGIN
Test16851168151712053
Validation18531054134212266
Training11,23172068902683424
Table 3. Performance metrics by category for SVM and CNN models.
Table 3. Performance metrics by category for SVM and CNN models.
MetricMITIEspaCy
Accuracy0.94760.8040
Recall0.86310.4936
Precision0.88610.5248
F1 score B-ORGANIZATION0.87910.7859
F1 score I-ORGANIZATION0.89280.7747
F1 score B-PERSON0.94280.8431
F1 score I-PERSON0.95010.8965
F1 score B-LOCATION0.93930.8581
F1 score I-LOCATION0.85790.6916
F1 score B-ORIGIN0.91840.7547
F1 score I-ORIGIN1.00.0
F1 score B-LINK0.76470.7647
F1 score B-EMAIL1.01.0
Bold values indicate the best performance for each metric.
Table 4. Statistics of anonymization processing by the LLM.
Table 4. Statistics of anonymization processing by the LLM.
Processing ResultOccurrencesDescription
Correctly resolved responses9458Data have a correct format that includes the processed text and entity hash map, which has the name of the unique entries as an initial anonymized paragraph.
Incorrect responses of valid format with incomplete hash map537All of the responses could be successfully parsed where some anonymized entities were absent in the entity map.
Incorrect responses with invalid format5Processed entities could not be parsed either due to a generation refusal or incorrect JSON object schema.
Table 5. Statistics of the content transformation processing by the LLM.
Table 5. Statistics of the content transformation processing by the LLM.
Processing ResultOccurrencesDescription
Correctly paraphrased responses9457News headlines that were paraphrased and hash differs from the base version.
Incorrect responses with unchanged content1News headlines which hashed were identical after processing.
Table 6. Statistics of the content transformation processing by the LLM.
Table 6. Statistics of the content transformation processing by the LLM.
Processing ResultOccurencesDescription
Correctly paraphrased responses9323News headlines containing all entities from the entity map after de-anonymization.
Incorrect responses with missing or reduced entities127News headlines that, after de-anonymization, were missing entities stored in the entity map or removed in the paraphrasing stage.
Incorrect responses with invalid format7Processed entities that could not be parsed as a string due to the generation refusal or JSON object form.
Table 7. Success results of the overall performance and individual steps.
Table 7. Success results of the overall performance and individual steps.
StepSuccess RateNote
Input anonymization94.58%In respect to 10,000 articles.
Paraphrasing99.98%In respect to correctly anonymized inputs.
Input de-anonymization98.58%In respect to correctly paraphrased news headlines.
Overall93.23%In respect to 10,000 articles.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Żarski, T.L.; Janicki, A. Enhancing Privacy While Preserving Context in Text Transformations by Large Language Models. Information 2025, 16, 49. https://doi.org/10.3390/info16010049

AMA Style

Żarski TL, Janicki A. Enhancing Privacy While Preserving Context in Text Transformations by Large Language Models. Information. 2025; 16(1):49. https://doi.org/10.3390/info16010049

Chicago/Turabian Style

Żarski, Tymon Lesław, and Artur Janicki. 2025. "Enhancing Privacy While Preserving Context in Text Transformations by Large Language Models" Information 16, no. 1: 49. https://doi.org/10.3390/info16010049

APA Style

Żarski, T. L., & Janicki, A. (2025). Enhancing Privacy While Preserving Context in Text Transformations by Large Language Models. Information, 16(1), 49. https://doi.org/10.3390/info16010049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop