1. Introduction
The exponential growth of online activities has intensified concerns over user data security and privacy. The vast volume of data generated and collected, combined with technological advancements, brings new attack vectors and technical oversights. The emergence of the latest artificial intelligence (AI) tools has increased the temptation to process data or solve problems rapidly. This urgency often leads users to overlook the risks of leaking sensitive personal data in exchange for time savings. To address these risks, measures such as limiting data retention have been introduced. This is done to prevent AI systems’ unintended learning of sensitive information, which remains a concern, even with OpenAI’s Zero Data Retention Policy. While OpenAI, in some cases, does not use these data to train models, it may retain them briefly to ensure user interactions align with the intended use and are not harmful [
1]. In 2023, 74.4% of the global population (5.16 billion people) were active Internet users, marking a 1.9% increase from 2022 [
2]. An estimated 149 zettabytes of data were generated [
3], underscoring the immense scale of user-generated content (UGC) [
4], whose main sources were the following:
Email services: 333.2B emails sent daily;
Instant messaging: 41.7M WhatsApp messages and 23.04B SMS texts sent daily;
Online social networks (OSN): X (650M daily tweets), Instagram (67.3M daily captions), and Facebook (734.4M daily comments and 421.9M daily status updates).
AI tools and services [
5]: OpenAI’s ChatGPT (1.5B monthly visits) and CharacterAI (318.8M unique monthly visitors).
Previously mentioned textual data sources, among others, are central to current AI and digitization trends, showcasing versatility in applications. These range from commercial companies leveraging email content in prompts for large language model (LLM) and Retrieval Augmented Generation frameworks to private users transforming or extracting insights from their OSN messages or posts. However, as the reliance on textual data grows, so do concerns about personal data privacy. This has led to the development of various anonymization tools and techniques to protect sensitive information in textual data. Existing solutions are based on frameworks like Stanford’s CoreNLP [
6] or spaCy [
7] that primarily utilize machine learning (ML) named entity recognition (NER) approaches and traditional natural language processing (NLP) techniques to detect personally identifiable information (PII). Beyond NER techniques, alternative methods like differential privacy [
8] or homomorphic encryption [
9] have gained traction in recent research. Techniques such as data masking,
k anonymity [
10],
l diversity [
11], and fictional LLM data generation [
12] continue to evolve, offering additional privacy protection by altering or generalizing data to prevent re-identification.
In this paper, we rely on the definition of personal and sensitive data established by the European Parliament and Council of the European Union, particularly the General Data Protection Regulation (GDPR). Definitions of personal data are based on regulation [
13], and sensitive data on GDPR law [
13]. In essence, any set of information that, when combined, can identify an individual is considered personal or sensitive data. It is important to note that even if personal information has been de-identified, encrypted, or pseudonymized, it remains protected under the GDPR if it can be used to re-identify a living person.
This research addresses these challenges by developing an efficient and adaptable conceptual solution to achieve higher detection accuracy while preserving the semantic integrity of the data. Unlike traditional solutions, CleanText leverages bespoke algorithms and uses LLMs to handle unstructured data effectively, ensuring that the anonymized text remains valuable for downstream applications. We envision the ideal application of our work as either a solution tailored for the end user or a highly customizable, independent platform. In this setup, the algorithms operate in the back end as a proxy for data processing. This approach is advantageous when using online services that lack on-premise deployment options, have terms of service that do not guarantee data security, or are prohibitively expensive for enterprise use. By incorporating a comprehensive suite of privacy-preserving techniques, CleanText addresses these limitations and offers a solution for data de-anonymization. The main contributions of this paper can be summarized as follows:
Training and tuning of models from the chosen toolkit libraries (alongside publishing of the Polish MITIE embedding model);
Evaluation of created models;
Design and implementation of algorithms for data (de-)anonymization,
Testing of the effectiveness of the complete system with a selected ML model in a possible real-life use case.
4. Discussion
With the advancement of AI and digitalization, users’ data are increasingly exposed to significant risks, prompting the development of enhanced privacy protection solutions, such as the DuckDuckGo browser. Evidence from the adoption of existing tools demonstrates a growing demand for such measures, as users actively seek technologies that provide a greater sense of security.
This study proposes a complex end-to-end solution for adding a layer of privacy to users’ textual data. CleanText analyzes text using a hybrid approach based on ML and rule-based NER, allowing for sensitive data removal by anonymization and, in the end, complete restoration of the previously masked entities, functioning as a proxy. The solution emphasizes essential features that create a flexible and generic framework.
Figure 5 and
Figure 6 preset the confusion matrices of both NER models. The analysis of the figures shows high precision in classifying rule-based entities, which was expected due to the predictable nature of the classes. Regarding the ML prediction, we notice the most consistent results for the
ORIGIN category across two models, likely due to their low correlation with other categories. However, some contextual overlaps created small challenges between
ORIGIN and
LOCATION classification. Additionally, the
ORGANIZATION and
PERSON classes are sources of clashes when company names include personal names. Similarly,
ORGANIZATION with
LOCATION or
PERSON class misclassifications occur when organizations are named after cities or the founders, such as “Sejm Warszawski”. To address issues with overlapping classes, we propose fine tuning of the NER model with an enriched training dataset that includes more examples of ambiguous entities in the given class. This thorough approach ensures that the model is well-equipped to handle a wide range of scenarios. Alternatively, we can use post-processing rules to ensure the labels of annotated entities are correct. Combining this information with the evaluation results from
Table 3 that include F1-score and accuracy metrics, together with a complete understanding of the confusion matrix, we concluded that MITIE is the best candidate for a default system model and tests. MITIE’s advantages in the F1-score and overall accuracy metric (
Table 3), as well as its less ambiguous and more consistent behavior (
Figure 5), make it a compelling choice. In addition to these metrics, MITIE and the wordrep tool enable the expansion of multilingual support for morphological languages, requiring only large amounts of unstructured text in the target language.
The results of tests of the performance of the CleanText pipeline are collected in
Table 7. Analyzed texts differed in complexity and length; where the minimal requirement was 35 words and at least five present entities. The designed algorithms performed almost flawlessly, reaching 93.23% on the chosen dataset and showing great potential in anonymization, achieving a correctness score of 94.58% and a de-anonymization of level of 98.58% for various scenarios. On the other hand, it is worth mentioning that more complex examples and usage of a smaller LLM could affect the results of the anonymization step, which can explain the five percentage points in the success rate. After manually confirming the failed examples of de-anonymization, we concluded that samples containing long or adjacent entities of the same type were often incorrectly retrieved, primarily due to the changed context that later influenced the LLM’s text summarization. According to the system results mentioned above, the main reason for the failures is the lack of clear boundaries between instances tagged with the inner and outer tags when the NER model detects entities. Hence, we once again notice that it is essential to focus on optimizing the NER model using the more extensive and diverse IOB label datasets to achieve more insightful detection to better instruct the LLM on replacing sensitive information.
Nevertheless, this study’s primary objective was to introduce ways to increase privacy while keeping the context of the text. We can say that this objective was fulfilled. Although it may not be as secure as using the on-premise solution provided by some service providers, it is an excellent alternative in situations where the service or tool provider does offer such an option.
Author Contributions
Conceptualization, A.J. and T.L.Ż.; methodology, T.L.Ż.; software, T.L.Ż.; validation, T.L.Ż.; formal analysis, T.L.Ż.; investigation, T.L.Ż.; resources, T.L.Ż.; data curation, T.L.Ż.; writing—original draft preparation, T.L.Ż.; writing—review and editing, A.J. and T.L.Ż.; visualization, T.L.Ż.; supervision, A.J. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Data Availability Statement
The dataset used in this study combines the
CEN [
32] and
KPWR [
34] datasets, which were selected for their comprehensive and balanced coverage of the required categories. Both datasets were remapped to a unified label set, and label density was evaluated to ensure a well-distributed representation across all entity classes. A Jupyter notebook with a simplified implementation of the algorithms used in this study is available upon request. The trained model used during our research experiments has been published and is available at
https://huggingface.co/tymzar/mitie-polish (accessed on accessed on 1 October 2024).
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
AI | Artificial Intelligence |
API | Application Programming Interface |
CNN | Convolution Neural Network |
GDPR | General Data Protection Regulation |
IOB | Inside–outside–beginning |
JSON | JavaScript Object Notation |
LLM | Large Language Model |
NER | Named Entity Recognition |
NLP | Natural Language Processing |
OSN | Online Social Network |
PII | Personally Identifiable Information |
RAG | Retrieval Augmented Generation |
SVM | Support Vector Machine |
UGC | User-generated Content |
wandb | Weights and Biases |
References
- Thakur, K. OpenAI Data Policies: Data Usage and Retention Period. DreamInForce. 2024. Available online: https://www.dreaminforce.com/openai-data-policies-data-usage-retention/ (accessed on 6 December 2024).
- Meltwater, W.A.S. Digital 2023 Global Overview Report. Available online: https://datareportal.com/reports/digital-2023-global-overview-report (accessed on 15 May 2023).
- Berisha, B.; Mëziu, E.; Shabani, I. Big data analytics in Cloud computing: An overview. J. Cloud Comp. 2022, 11, 24. [Google Scholar] [CrossRef]
- Rayaprolu, A. How Much Data Is Created Every Day in 2023. Available online: https://techjury.net/blog/how-much-data-is-created-every-day (accessed on 16 November 2023).
- Sarkar, S. AI Industry Analysis: 50 Most Visited AI Tools and Their 24B+ Traffic Behavior. Available online: https://writerbuddy.ai/blog/ai-industry-analysis (accessed on 9 February 2024).
- Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014; pp. 55–60. [Google Scholar]
- Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. spaCy: Industrial-Strength Natural Language Processing in Python. 2020. Available online: https://zenodo.org/records/10009823 (accessed on 2 October 2024).
- Dwork, C.; Roth, A. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
- Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic Encryption for Arithmetic of Approximate Numbers. In Advances in Cryptology—ASIACRYPT 2017; Springer: Cham, Switzerland, 2017; pp. 409–437. [Google Scholar] [CrossRef]
- El Emam, K.; Dankar, F.K. Protecting privacy using k-anonymity. J. Am. Med. Inform. Assoc. 2008, 15, 627–637. [Google Scholar] [CrossRef]
- Li, T.; Li, N.; Cao, J.; Yang, W. Slicing: A New Approach for Privacy Preserving Data Publishing. IEEE Trans. Knowl. Data Eng. 2018, 24, 561–574. [Google Scholar] [CrossRef]
- Böhlin, F. Detection & Anonymization of Sensitive Information in Text: AI-Driven Solution for Anonymization. Bachelor’s Thesis, Linnaeus University, Linneuniversitetet, Sweden, 2024. Available online: https://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-131672 (accessed on 14 November 2024).
- European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council. Off. J. Eur. Union 2016, 119, 1–88. Available online: https://data.europa.eu/eli/reg/2016/679/oj (accessed on 6 December 2024).
- Madan, A.; George, A.M.; Singh, A.; Bhatia, M.P.S. Redaction of Protected Health Information in EHRs using CRFs and Bi-directional LSTMs. In Proceedings of the 7th International Conference on Reliability, Infocom Technologies and Optimization, Noida, India, 29–31 August 2018. [Google Scholar] [CrossRef]
- Neamatullah, I.; Douglass, M.M.; Lehman, L.W.; Reisner, A.; Villarroel, M.; Long, W.J.; Szolovits, P.; Moody, G.B.; Mark, R.G.; Clifford, G.D. Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 2008, 8, 32. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Cullen, R.M.; Godwin, M. Hidden Markov model using Dirichlet process for de-identification. J. Biomed. Inform. 2015, 58, S60–S66. [Google Scholar] [CrossRef]
- Shweta; Kumar, A.; Ekbal, A.; Saha, S.; Bhattacharyya, P. A Recurrent Neural Network Architecture for De-identifying Clinical Records. In Proceedings of the 13th International Conference on Natural Language Processing, Varanasi, India, 17–20 December 2016; Available online: https://aclanthology.org/W16-6325 (accessed on 30 October 2024).
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Holla, A.; Gaind, B.; Katta, V.R.; Kundu, A.; Kamalesh, S. Hybrid NER System for Multi-Source Offer Feeds. arXiv 2019, arXiv:1901.08406. [Google Scholar] [CrossRef]
- Meselhi, M.A.; Abo Bakr, H.M.; Ziedan, I.; Shaalan, K. Hybrid Named Entity Recognition—Application to Arabic Language. In Proceedings of the 9th International Conference on Computer Engineering & Systems (ICCES), Cairo, Egypt, 22–23 December 2014; pp. 80–85. [Google Scholar] [CrossRef]
- Dias, M.; Boné, J.; Ferreira, J.C.; Ribeiro, R.; Maia, R. Named Entity Recognition for Sensitive Data Discovery in Portuguese. Appl. Sci. 2020, 10, 2303. [Google Scholar] [CrossRef]
- Szawerna, M.; Dobnik, S.; Sánchez, R.; Tiedemann, T.; Volodina, E. Detecting Personal Identifiable Information in Swedish Learner Essays. In Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization, St. Julian’s, Malta, 17–22 March 2024; pp. 54–63. Available online: https://aclanthology.org/2024.caldpseudo-1.7 (accessed on 6 December 2024).
- Kužina, V.; Petric, A.; Barišić, M.; Jović, A. CASSED: Context-based Approach for Structured Sensitive Data Detection. Expert Syst. Appl. 2023, 223, 119924. [Google Scholar] [CrossRef]
- Akbik, A.; Bergmann, T.; Blythe, D.; Rasul, K.; Schweter, S.; Vollgraf, R. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 54–59. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 8–14 December 2019; pp. 8026–8037. [Google Scholar]
- Kaggle; Kužina, V.; Petric, A.; Barišić, M.; Jović, A. DeSSI Dataset for Structured Sensitive Information. Available online: https://www.kaggle.com/datasets/sensitivedetection/dessi-dataset-for-structured-sensitive-information (accessed on 12 December 2023).
- Kutbi, M. Named Entity Recognition Utilized to Enhance Text Classification While Preserving Privacy. IEEE Access 2023, 11, 117576–117581. [Google Scholar] [CrossRef]
- Sun, X.; Liu, G.; He, Z.; Li, H.; Li, X. DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts. arXiv 2024, arXiv:2408.08930. Available online: https://arxiv.org/abs/2408.08930 (accessed on 17 November 2024).
- Tedeschi, S.; Navigli, R. MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation). In Proceedings of the Findings of the Association for Computational Linguistics: NAACL, Seattle, WA, USA, 10–15 July 2022; pp. 801–812. [Google Scholar] [CrossRef]
- Tedeschi, S.; Maiorca, V.; Campolungo, N.; Cecconi, F.; Navigli, R. WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2521–2533. [Google Scholar] [CrossRef]
- Ikhwantri, F. Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER. arXiv 2019, arXiv:1907.11158. [Google Scholar] [CrossRef]
- Marcińczuk, M. CLARIN-PL Digital Repository—CEN. 2007. Available online: http://hdl.handle.net/11321/6 (accessed on 13 June 2024).
- Degórski, Ł.; Przepiórkowski, A. Recznie znakowany milionowy podkorpus NKJP. In Narodowy Korpus Języka Polskiego; Cierkonska, J., Ed.; Naukowe PWN: Warsaw, Poland, 2012; pp. 51–57. [Google Scholar]
- Broda, B.; Marcińczuk, M.; Maziarz, M.; Radziszewski, A.; Wardyński, A. KPWr: Towards a Free Corpus of Polish. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, Istanbul, Turkey, 23–25 May 2012; pp. 3218–3222. [Google Scholar]
- King, D.E. Dlib-ml: A Machine Learning Toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
- Dhillon, P.; Rodu, J. Two Step CCA: A New Spectral Method for Estimating Vector Models of Words. 2012. Available online: https://arxiv.org/abs/1206.6403 (accessed on 3 September 2024).
- Żarski, T. Trained MITIE Model for the Polish Language Published on Hugging Face. Available online: https://huggingface.co/tymzar/mitie-polish (accessed on 1 October 2024).
- Nivre, J.; Zeman, D.; Ginter, F.; Tyers. Universal Dependencies. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts, Valencia, Spain, 3–7 April 2017. [Google Scholar]
- Chat Completion API Reference. Available online: https://platform.openai.com/docs/api-reference/chat/create (accessed on 5 May 2024).
- spaCy API Documentation; Explosion AI. Parser Architectures. Available online: https://spacy.io/api/architectures#parser (accessed on 8 December 2024).
- Biewald, L. Experiment Tracking with Weights and Biases. 2020. Available online: https://www.wandb.com/ (accessed on 12 September 2024).
- Parameter Importance Documentation. Available online: https://docs.wandb.ai/guides/app/features/panels/parameter-importance (accessed on 15 November 2024).
- Hugging Face—WiktorS/Polish-News Dataset. Available online: https://huggingface.co/datasets/WiktorS/polish-news (accessed on 5 September 2024).
- National Institute of Standards and Technology. Secure Hash Standard (SHS); Federal Information Processing Standards Publication 180-4; U.S. Department of Commerce: Gaithersburg, MD, USA, 2015. Available online: https://csrc.nist.gov/pubs/fips/180-4/upd1/final (accessed on 22 June 2024).
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).