1. Introduction
The data revolution has experienced a significant transformation of the global environment, a phenomenon particularly underscored by the introduction of electronic chips in the late 1960s. This technological milestone caused a paradigm shift in data management techniques and laid the foundations for the ultimate transition in several industries, including healthcare, from analog documentation to digital records [
1]. EHRs have emerged as a vast compendium of health-related data, compiled from many sources, including clinics, hospitals, labs, and other healthcare facilities. This development improved clinical workflows and patient care delivery by radically changing how medical information is stored and retrieved [
2].
The Internet and information technology advancements have allowed health records to integrate and become more easily accessible to authorized users, beyond traditional boundaries [
3]. This fluid sharing of information facilitates a comprehensive understanding of a patient’s medical history, treatment plans, and results, encouraging cooperation amongst medical providers and improving the standard of patient care. In addition, the integration of health records into healthcare systems promotes evidence-based decision-making and improves administrative effectiveness, which in turn raises the standard and effectiveness of medical care [
4]. Additionally, this integrated approach to health data has the potential to advance medical research, make customized treatment options possible, and stimulate innovations that could have a significant impact on the efficacy and outcomes of healthcare.
The adoption of EHRs in place of traditional paper-based health records has completely changed the way data are accessed, integrated, and shared among healthcare providers, where it offers many advantages, such as better clinical decision-making, increased care coordination, and reduced administrative responsibilities [
5]. These electronic databases include patient demographics, medical histories, the results of diagnostic tests, treatment plans, and progress notes, providing an in-depth understanding of a patient’s status [
6]. This extensive and interconnected data ecosystem empowers healthcare providers to make informed decisions and deliver tailored, effective care. The ongoing progress of EHR technology and data interoperability provides substantial opportunities for improving research, public health initiatives, and healthcare delivery and enhancement of global healthcare [
7].
Health-related data are sensitive and critical; therefore, protecting them is essential. The legal framework, represented by the Health Insurance Portability and Accountability Act (HIPAA), requires the safeguarding of patient privacy to avoid harm and uphold patients’ rights [
8]. Unauthorized access to or disclosure of health information may have serious repercussions, such as discrimination, identity theft, and reduced access to healthcare services. Securing patient data and maintaining patient–provider confidence requires strong security measures such as encryption, access controls, and regular audits [
9]. Patient notes stored in electronic health records may contain vital information for medical investigations. Most medical investigators can only access de-identified notes to protect patients’ confidentiality. The Health Insurance Portability and Accountability Act of 1996 in the United States defines 18 types of Protected Health Information that must be removed to de-identify patient notes [
10]. De-identification can be performed using automated or manual procedures. Experts must identify and label PHI to perform manual de-identification; however, this approach is restricted by the number of people who are allowed access to identified patient notes and the possibility of human error. Automated de-identification systems, on the other hand, make use of rule-based or machine-learning techniques. Typically, rule-based systems depend on human-defined patterns expressed as lexicons and regular expressions [
11].
Our research highlights the importance of sophisticated natural language processing (NLP) techniques in safeguarding health information and offers an effective approach for de-identifying EHR that comply with privacy laws.
The primary aim of this research paper is to provide a strong framework for de-identifying sensitive data in EHRs by utilizing deep learning and machine learning techniques. This research employs various algorithms to guarantee the secure extraction of personal patient data while concurrently upholding the integrity of essential clinical information. The dataset utilized for this research is sourced from Harvard University, specifically designated for academic users, by the Data Use and Confidentiality Agreement established with Partners HealthCare System, Inc. This agreement confers access to de-identified patient discharge summaries for research purposes, with a particular focus on the enhancement of natural language processing (NLP) techniques within the healthcare domain. This research investigates the following pivotal inquiries:
How much have machine learning and deep learning algorithms contributed to the security of EHR?
Is the proposed algorithm capable of effectively removing sensitive patient data from a large dataset?
What is the significance of the Conditional Random Field (CRF) model in safeguarding patient information, particularly concerning the 23 critical categories of Protected Health Information (PHI) present in EHRs?
How effective is the Bidirectional Long Short-Term Memory (Bi-LSTM) deep learning algorithm at improving de-identification in medical records?
The remainder of this paper is organized as follows:
Section 2 encompasses a comprehensive review of related works in E-health records privacy. It critically discusses and analyzes various approaches and experimental findings from these works, while
Section 3 outlines the research methodology,
Section 4 presents the results and discussion, and
Section 5 contains the conclusion.
2. Related Work
This section presents relevant literature on the protection of health records using machine learning and deep learning techniques.
A survey titled “Use of Electronic Health Records in Hospitals in the United States” presented comprehensive research of all American hospitals provided to count the number of records in hospitals, where they studied a proportion of the responses obtained with an average of 63.1 percent [
12]. Only 1.5% of hospitals in the United States have a comprehensive electronic records system (i.e., one that is present in all clinical units), whereas 7.6% have a basic system (i.e., one that is present in at least one clinical unit). Only 17% of hospitals have automated provider–order input for drugs in place. Larger hospitals are more likely to have EHRs [
13]. Leevy et al. [
14] employed CRF and Bi-LSTM layers, where the work was defining the text on the RNN neural network, the Conditional Random Field, and employing RNN-based techniques such as the LSTM algorithm. The study contains vital information that helps in the safe and secure preservation of patient privacy. Qin and Zeng [
15] researched clinical named entity recognition using Bi- LSTM-CRF. The goal of their study was to use neural networks to extract clinical concepts, where a two-way Bi-LSTM model was used in conjunction with the CRF model to detect three named medical entities, as they were entered into a training process in CBOW in both the field and non-field. They used the i2b2 dataset on NER and compared to other research. They achieved an F1-score of 0.85307, and they revealed certain specific concerns with clinical concepts to clarify more profound studies.
Tang et al. [
16] de-identified clinical text using Bi-LSTM-CRF and neural language models. They focused on identifying the entity named NER, as it was applied to the latest deep learning models Bi-LSTM-CRF to remove the de-identification. They used two datasets: i2b2-2014 and CEGS N GRID-2016. The goal of [
17]’s work was to relieve the necessity of manual labeling or feature preparation. However, they proceeded with an automated procedure, and this time it was performed through desensitization and sequence labeling. The data were then fed into the Bi-LSTM model, which applied word embeddings to each word to capture contextual information from both forward and backward directions. An attention layer processed semantic labels from the surrounding text so that the model could pay more attention to those words. A Conditional Random Field (CRF) model was then used to extract entities in the end. The results show the high performance of Bi-LSTM-CRF for entity extraction on Chinese EMRs. Hong et al. [
18] proposed a powerful tool called BBC-Radical that combines three of the most important models BERT (Bidirectional Encoder Representations from Transformers), Bi-LSTM, and CRF. The tool is implemented by incorporating features based on Chinese character symbols in the word, and radical root features to show how this approach improves the model performance for the representation of common meanings.
Chinese EMRs were applied to process a large amount of medical data in [
19]; the study proposed a BERT-IDCNN-CRF model. In the methodology, the EMRs were first loaded into BERT, trained, and finally, the model had to be manually annotated for relevant parameters. The experimental stage shows a very high accuracy of 94.1%, recall of 93.8%, and F1-score of 94.5% using the BIOES (Begin, Inside, Outside, End, Single) tagging scheme supervised learning approaches. Ming et al. [
20] addressed the drawbacks of the LSTM model, which found it difficult to take full advantage of the GPU parallelism while processing substantial amounts of medical data. Furthermore, it was observed that the word order and semantic information were ignored by ID-CNNs. The authors suggested an ID-CNNs-CRF model with attention methods to address these problems. With F1-scores of 0.9455 and 0.9117, this model was able to effectively capture word order, attentiveness, and local contextual variables.
Using a Bi-LSTM-CRF model, Liu et al. [
21] concentrated on detecting named items in biomedical data, using one layer of CRF and two layers of LSTM. The model achieved an F1-score of 0.872. This study used the TASS-2018 dataset and achieved first place in its sub-task. Liu et al. [
22] focused on identifying clinical quantitative data and connecting them to relevant entities. For this study, 1359 documents from a nearby general hospital’s petroleum department were examined. The results indicated the efficacy of the strategy with an F1-score of 0.9477 for recognizing quantitative information, and an accuracy of 0.9460 for associating entities with quantities. Shunli et al. [
23] sought to tackle the issue of modern systems lacking the ability to effectively capture low-frequency terms through manual features. The suggested approach involved utilizing a Bi-LSTM-CRF neural network on electronic medical records. The procedure started with using CCNs (Convolutional Neural Networks) for encryption, then proceeded to the Bi-LSTM layer, and concluded with the CRF layer for entity extraction. The research utilized two clinical datasets that are openly accessible: the THYME corpus and the 2012 i2b2 dataset.
The key distinction between our research and previous studies lies in our dedicated approach to the design, optimization, and hyperparameter tuning of both the Bi-LSTM and CRF models. Unlike prior work, which typically combines Bi-LSTM and CRF into a single unified model, our research takes a more granular approach by treating each model individually. This allows for the more precise optimization of each model at every stage of development. We carefully designed the architecture of both deep learning models, placing particular emphasis on fine-tuning their hyperparameters to achieve optimal accuracy. This focused process of hyperparameter optimization, a critical yet under-explored aspect in the literature, plays a pivotal role in enhancing model performance.
Additionally, by conducting comprehensive evaluations across 23 distinct categories, we provide a detailed and independent analysis of each model’s performance. This thorough approach not only offers a more nuanced understanding of the models’ behaviors but also ensures that the achieved accuracy surpasses that of existing studies. Our rigorous and systematic optimization process, coupled with independent testing across diverse categories, addresses a significant gap in prior research where such meticulous architectural and performance comparisons are frequently overlooked.
3. Methodology
Electronic health records have completely revolutionized how medical data are recorded, accessed, and shared in the modern digital health environment. They enable more efficient data management, which enhances the information that healthcare professionals have access to and supports the integration of patient care in different healthcare settings. However, there are significant concerns regarding patient data security and privacy as EHRs expand. Deep learning, a field of artificial intelligence, has emerged as a effective technique for strengthening the security of sensitive EHRs. Bi-LSTM is a novel deep learning paradigm that builds upon the traditional LSTM architecture to make it highly effective in sequence modeling tasks. Bi-LSTM analyzes sequences bidirectionally, which allows it to identify contextual dependencies from the past and future more efficiently than regular LSTM, which processes information in a single direction [
24]. This bidirectional feature leverages the complete context that is included in the data to enable the creation of more accurate predictions. The structure consists of interconnected layers of neurons, or cells, with the ability to store, change, and reset internal states [
25]. The network acquires knowledge sequence patterns and correlations as data flow through these cells, making Bi-LSTM especially appropriate for time-series data, such EHRs. The advantages of this technology include enhanced contextual sensitivity, real-time monitoring, and adaptability to changing patterns, among other attributes that raise its significance in enhancing security standards in healthcare systems [
26]. Healthcare institutions can effectively protect patient data from new threats by incorporating Bi-LSTM into security frameworks. This ensures patient privacy and preserves confidentiality in digital healthcare infrastructures.
Our proposed methodology integrates CRF and Bi-LSTM, as illustrated in
Figure 1, to accurately identify categories of Protected Health Information (PHI). CRF, a statistical model renowned for its ability to predict label sequences, is leveraged to enhance the performance of PHI identification. Central to our contribution is the careful design, optimization, and hyperparameter tuning of both models, which forms the backbone of our deep learning approach. Unlike conventional methods, we optimize each model individually, fine-tuning the hyperparameters to achieve the best possible performance. By combining these models and emphasizing rigorous model design and hyperparameter optimization, our approach aims to surpass traditional methods in accuracy, recall, precision, and overall effectiveness, offering a more robust solution for PHI categorization.
3.1. Data Extraction
In this research, we utilize the i2b2 (Informatics for Integrating Biology and the Bedside) dataset, obtained from the Department of Biomedical Informatics at Harvard Medical School [
27,
28,
29]. The dataset is publicly accessible at
https://n2c2.dbmi.hms.harvard.edu/data-sets (accessed on 10 December 2024), subject to a Data Use Agreement (DUA). The i2b2 dataset is a valuable resource for healthcare researchers. It is a collection of de-identified clinical records that include a variety of medical data, such as clinical notes, discharge summaries, and radiology results. The dataset is well known for its diversity and breadth, making it an excellent candidate for training NLP models, particularly in the context of clinical text processing and information extraction. The i2b2 dataset includes PHI categories, enabling researchers to address the difficulty of PHI identification while respecting patient privacy through data anonymization. The Data Usage Agreement (DUA) is used to access and retrieve the dataset, and registration was required. The i2b2 dataset consists of 21 compressed files, and all medical records are stored in the (XML) file format.
3.2. Data Preprocessing
The data preprocessing phase includes three key phases to prepare the dataset for feature extraction.
3.2.1. Data Parsing
The first step in data preprocessing focuses on transforming the incomprehensible, unreadable, and unstructured text into a more coherent and interpretable format. This critical step facilitated the extraction of relevant information without the time-consuming and manual preparation that is typically required. Advanced NLP techniques are used in the process, which effectively transformed raw data into more understandable texts, allowing for efficient information retrieval and analysis.
3.2.2. Bag-of-Word (BOW)
The data are tokenized, in which all the words within each record are segmented using a delimiter (,). This tokenization method converts the text into an arbitrary “bag of words” (BOW) model, which is defined by fixed and specific length vectors. Each vector represents the frequency with which individual words appear in the text [
30]. In the context of BOW models, this method is commonly referred to as a “directed” approach. The BOW model is based primarily on the identification of a vocabulary, which includes a set of words found in the record. Individual words are isolated and distinguished using the delimiter (,) to form the vocabulary. The number of times each word appears in the record is then counted and recorded. This process produces a numerical representation of the text’s content, effectively capturing word frequencies for further analysis and modeling.
3.2.3. Data Tagging and POS Analysis
In this phase, the text is segmented into eight main parts of speech, collectively known as Part of Speech (POS) categories. These categories comprise nouns, pronouns, verbs, adjectives, adverbs, prepositions, and interjections. Data tagging facilitates data annotation for subsequent data tagging tasks. It is carried out using a three-letter notation consisting of “B”, “I”, and “O”. Each letter corresponds to a specific meaning; “B” indicates the “Beginning” of a tagged entity, and “I” refers to the “Intermediate” entity, while “O” refers “Other”, indicating that the word does not belong to any tagged entity [
31]. Within the electronic medical records, a total of 23 basic categories are identified for data tagging, including Date, Patient, Age, Profession, Medical Record, Hospital, Doctor, Street, City, State, ZIP, Phone, Id_Num, Username, Organization, FAX, Country, BIOID, Location, Device, Email, Health Plan, And URL. These categories help in organizing specific information within medical records, laying the groundwork for future analysis and model development. Each tag in the record is associated with the corresponding text above it during this process. To denote their positions and roles, the existing tags are modified with additional letters. The letter “B” is appended to the first word in a category, while the letter “I” is appended to subsequent words in the same category. The letter “O” on the other hand, denotes unimportant words that do not fall into any of the required 23 categories specified in the tags. The tags are associated with several entity types, including Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). For example, the sentence “EU rejects German call to boycott British lamb”, is tagged as (B-ORG O B-MISC O O O B-MISC O O), where we have the following:
B-ORG: “EU” is identified as the beginning of an organization.
O: “rejects” does not fit within any recognized category.
B-MISC: “German” is tagged as the beginning of a miscellaneous entity.
The sequence continues with more O tags that are not part of any recognized category.
B-MISC: “British” is tagged as the beginning of a miscellaneous entity.
The number of occurrences for each of the 23 main PHI categories is summarized in
Table 1. The number of occurrences for the “Other” (O) category after data reprocessing is 775,812.
3.3. Bi-LSTM-CRF Model
The idea behind the sequence labeling challenge is to convert an input sequence into a corresponding, identically sized output sequence. As a result, constructing a Recurrent Neural Network with a sufficient number of hidden layers has produced outstanding outcomes for sequence-to-sequence conversions. Here, recurrent connections serve as a kind of memory that allows the network’s internal state to hold on to insights from the previous inputs. The label applied to the input data is represented by the final output, which is established by taking previous inputs into account as well. There have been several versions of Recurrent Neural Nets developed, but the LSTM architecture stands out since it solves the problem of disappearing gradients effectively. Therefore, the LSTM network is frequently preferred for labeling words inside possibly long sentences. However, since the whole phrase is known ahead of time, the label assignment of a given word can be carried out more precisely if it takes into account the words that come before and after it. It is more efficient to represent these dependencies by using the Bidirectional LSTM [
32]. To analyze data from both directions of the input sequence, the model’s network design (as illustrated in
Figure 2) includes forward and backward LSTM units. The model is more capable of capturing word dependencies, regardless of where they are in the sentence, due to this bidirectional arrangement. The model is able to comprehend each word’s meaning and grammatical structure better as a result of the concatenation of the PoS vectors and the embedding layer output. As each word’s label can be predicted individually by the Bi-LSTM, the relationships between labels in the output sequence cannot be captured by the model. One word that is labeled as the beginning of an entity (B) in PoS tagging, for instance, increases the possibility that the following word will be marked as the inner (I) of the same entity. CRF comes in. To guarantee that the predicted labels adhere to correct sequential patterns, the CRF layer is layered on top of the Bi-LSTM. The possibility of label transitions (e.g., from B to I, or from I to O) is taken into consideration to effectively model the relations between the expected labels.
3.4. Bi-LSTM-CRF Model Architecture, Hyperparameter Tuning, and Optimization
The Bi-LSTM-CRF model developed for this research integrates a Bi-LSTM network with a Conditional Random Field (CRF) layer to enhance sequential tagging tasks by capturing both contextual dependencies and transition constraints between output tags. The model’s key components are as follows:
Embedding Layer: The input words are transformed into dense vector representations using an embedding layer with a dimensionality range of 50 to 300.
Bi-LSTM Layer: A multi-layer bidirectional LSTM network is employed with a hidden dimension ranging from 50 to 512, to capture both forward and backward contexts for each token.
CRF Layer: A transition matrix, initialized randomly, scores transitions between tags, including constraints to prevent invalid transitions to and from designated start and stop tags.
Key hyperparameters were fine-tuned to balance performance and computational efficiency:
Learning Rate: Ranges from 0.001 to 0.05, optimizing model convergence while maintaining stability.
Weight Decay: Regularization is applied with a weight decay parameter ranging from 1 × 10−4 to 1 × 10−2, effectively controlling overfitting.
Batch Size: The model processes samples in batches, with a batch size ranging from 8 to 64, enhancing computational efficiency and convergence.
Epochs: Training is performed over 300 epochs to ensure sufficient learning while monitoring for convergence.
The model training utilizes Stochastic Gradient Descent (SGD), selected for its straightforward implementation and effective balance of convergence speed and stability. The design allows for the potential inclusion of gradient clipping to manage exploding gradients, a common consideration in Bi-LSTM models.
The Stochastic Gradient Descent (SGD) optimizer in this model is configured with the following key parameters:
Learning Rate: This is set within the range of 0.001 to 0.05 to balance the learning speed with convergence stability. This range is selected based on empirical observations, providing consistent convergence without overshooting.
Weight Decay: This is set within the range of 1 × 10−5 to 1 × 10−2 as a regularization term, which helps control overfitting by penalizing large weights during training.
This Bi-LSTM-CRF architecture thus provides a robust framework for sequence labeling, leveraging the ability of Bi-LSTM to model contextual dependencies and the structured output layer of the CRF to enforce tag transition rules as summarized in Table
Table 2.
Model Training and Evaluation Metrics
The dataset consists of 1304 records, which are split into 790 training records and 514 testing records, equivalent to a 60% to 40% split, as presented in
Table 3. Each record represents specific medical information about individual patients. In total, there are 269 unique patients, and each patient’s data are represented by 3 to 5 files of XML type. The file naming convention follows a pattern, denoted as (XXX-YY.xml), where XXX indicates the patient number and YY represents the record number within that patient’s dataset. The dataset structure is designed to efficiently manage and differentiate the medical records associated with each patient, coherently facilitating comprehensive analysis and research.
The Bi-LSTM-CRF model is trained using the i2b2 dataset over iterative cycles, referred to as epochs. Each epoch represents a full pass through the training dataset, during which the model’s parameters, including weights, are updated based on the input data. After one epoch, the model processes each sample in the training set once, allowing for gradual improvement in its performance through repeated evaluations. This iterative training process enables the model to enhance its data processing capabilities with each cycle, progressively refining its ability to handle complex inputs. To ensure optimal performance, key hyperparameters, such as the learning rate, batch size, and number of epochs, were tuned based on empirical testing. A learning rate of 0.001 is selected to ensure stable convergence, while a batch size of 32 allowed for efficient processing without overloading the system’s memory. The number of epochs is set to 50, providing sufficient training time for the model to learn complex patterns in the data without overfitting. By executing the model over multiple epochs with carefully chosen hyperparameters, sufficient computational resources are allocated to improve its predictive accuracy, thereby enhancing its capacity to identify patterns and generate more reliable predictions.
To better understand the dataset, the model improves its “knowledge base” after each cycle. The iterative approach helps the model to get to know the subtleties of the dataset and make more accurate predictions based on the knowledge it has gained. The model performance is evaluated through three critical metrics: accuracy, recall, and the F1-score. However, relying only on accuracy is insufficient in evaluating an imbalanced dataset (where certain kinds of data may be disproportionately represented). As a result, we also include recall, which assesses the model’s ability to recognize important data, and the F1-score, which combines recall and accuracy to produce a more complex evaluation of the model’s performance.
4. Results and Discussion
In this research, we use the Bi-LSTM-CRF and CRF models, where the CRF model is an easier-to-understand architecture and helps in reducing overfitting, especially on smaller datasets. However, the Bi-LSTM-CRF model makes use of deep learning’s advantages for sequential data processing by combining a Bi-LSTM network with a CRF layer. The Bi-LSTM component is potentially appropriate for detecting hidden trends and contextual relations since it can capture intricate correlations within sequences. Bi-LSTM and other deep learning models demonstrate high accuracy on larger and more complex datasets, but their performance is highly sensitive to dataset size and requires careful hyperparameter tuning. These models are particularly advantageous for identifying intricate patterns that may be overlooked by simpler models, making them well suited for complex data analysis tasks.
4.1. Optimized Parameter Settings and Model Configuration for Bi-LSTM-CRF
As outlined in
Section 3.4, this section presents the optimized parameter settings for the Bi-LSTM-CRF model, tailored for effective PHI tag detections. The optimization of the Bi-LSTM-CRF model is crucial, as it enhances its ability to accurately capture sequential dependencies and improve tagging performance, especially in complex datasets.
Table 4 provides a detailed summary of the optimal configuration achieved for the model’s architecture, hyperparameter tuning, and optimization strategies. As the experimental results show, the architecture leverages an embedding layer with a dimensionality of 60 to convert input words into dense vector representations, which enhances the model’s ability to learn meaningful relationships between words. A single-layer Bi-LSTM network with a hidden dimension of 80 is employed to capture the bidirectional context for each token, ensuring that both past and future contexts contribute to the model’s understanding of sequential dependencies. A CRF layer, initialized with a transition matrix, scores transitions between output tags and enforces constraints to prevent invalid tag transitions, contributing to improved tagging accuracy. Hyperparameter tuning is performed to achieve an optimal balance between model performance and computational efficiency. The learning rate is set to 0.01, which provides a stable convergence. A weight decay parameter of 1 × 10
−4 is used to regularize the model, helping control overfitting by penalizing large weights. Additionally, a batch size of 32 is selected, which allows the model to process data efficiently while benefiting from gradient updates. The model is trained over 300 epochs, with the best performance observed at epoch 31, indicating the need for early stopping to prevent overfitting. Finally, the optimization is conducted using the SGD algorithm, chosen for its straightforward implementation and effectiveness in balancing learning speed and stability. The table summarizes these configurations, highlighting the model’s robust framework for sequence labeling tasks.
4.2. CRF Model
The CRF model works with conditional distributions instead of joint distributions, where the conditional distribution does not enforce independent labels, making it easier to interpret natural language. Moreover, it defines the phrase’s order by integrating the probabilistic evaluation of the unique symbols contained in a sentence [
33]. The CRF model is independently applied as part of the initial evaluation phase.
Table 5 presents the outcomes of these evaluations in an organized manner. This table provides a comprehensive overview of the model’s effectiveness, covering a variety of assessment criteria such as accuracy, precision, recall, and F1-score.
The CRF performs exceptionally well in terms of accuracy metrics, with an accuracy rate of 99%, an F1-score of 98%, and a recall of 99% as shown in
Table 5. These analyses provide additional knowledge about the CRF model’s performance on the i2b2 dataset. The enhanced performance of the HIPAA-level sets has the potential to significantly improve data security and privacy in EHR, which might have a significant positive impact on real-world healthcare applications.
The results of the comprehensive assessment for each of the 23 basic categories of PHI as defined by HIPAA are shown in
Table 6. B-DATE performs best out of all of these categories, achieving the maximum level of precision and efficiency. Within this evaluation framework, the B-DATE category achieves an exceptional precision score of 99%, indicating the model’s ability to correctly identify actual positive instances. Moreover, the model’s 98% recall score demonstrates its ability to identify real positive cases. A crucial measure for balancing recall and precision, the F1-score, is 98%, indicating that the B-DATE model is robust in achieving an equitable balance between these two crucial performance aspects.
According to
Table 6 and
Figure 3, we can note that the numbers of OTHER “O” category represents 95% of all records with 4980 records, and it is achieved the best classifier results with precision and an F1-score of 99%, as well as the other categories with a high number of support records such as B-DATE with 4980 records, B-DOCTOR with 1912, and I-DOCTOR with 1350. All of these categories achieve good metrics results as presented in
Figure 4.
When analyzing the results according to B-tags and I-tags as presented in
Table 7 and
Figure 5, the B-tags indicate the beginning of entities; hence, it is critical to precisely define these boundary errors in determining the beginning of an entity, as they may result in a series of I-tag mistakes, which reduces its overall effectiveness. The I-tags often consist of fragments of longer or more complex items. Misclassifications are more common when an entity is extended (as shown by I-tags), especially when the model has difficulty recognizing contextual details. Moreover, several I-tags show noticeably lower frequencies of occurrence (that is, fewer instances in the dataset). As an illustration, categories like I-AGE and I-COUNTRY show shallow support (with only 1 or 13 instances), which reduces the model’s performance. However, compared to the extension of the entity with I-tags, the identification of the initiation of an entity with B-tags typically performs better since it is frequently less complicated and dependent on contextual clues. Furthermore, the dataset’s restricted frequency of I-tag categories might exacerbate I-tags’ poor results compared to B-tags.
The model’s performance, however, shows a strong bias towards high-frequency labels, particularly the “Other” (O) label, which constitutes 95% of the dataset records as shown in
Table 5 and
Figure 3. This label achieves high precision and an F1-score of 99%, indicating that the model is highly accurate for the “Other” class. Similarly, categories with higher support counts, such as B-DATE, B-DOCTOR, and I-DOCTOR, also display strong metrics as presented in
Figure 4. However, the model struggles with lower-frequency categories, particularly those labeled with I-tags, which represent internal segments of entities and often require more nuanced contextual understanding. The infrequent occurrence of certain I-tags, such as I-AGE and I-COUNTRY with only 1 and 13 instances, respectively, hinders the model’s ability to generalize effectively for these classes. As a result, the model’s overall performance across PHI categories is uneven, showing marked accuracy for frequent labels but decreased precision and recall for less common, more complex labels. This highlights a limitation in the model’s generalizability, emphasizing the need for more balanced dataset representation to improve classification across all PHI categories.
4.3. Bi-LSTM-CRF Model
In the Bi-LSTM-CRF design, the CRF component makes post-prediction processing easier and produces sense results. The CRF algorithm uses the relationships between adjacent tokens to correct for any important data missed in the first prediction, improving the accuracy of the result. By implementing the Bi-LSTM-CRF model, we can guarantee enhanced workflow efficiency for medical professionals and improve data protection accuracy. Healthcare organizations can enhance patient trust, improve data-breach defenses, and preserve the ethical responsibility to secure sensitive health records. The results of the Bi-LSTM-CRF model are presented in
Table 8. Regarding the accuracy metrics, the Bi-LSTM-CRF model performs admirably, with an accuracy rate of 98%, a recall score of 99%, and an F1-score of 99%.
The evaluation results for all 23 core categories of PHI as defined by the HIAA using the Bi-LSTM-CRF model are presented in
Table 9. When analyzing the results in the table and by looking into support values, we can see the difference between support in each category as presented in
Figure 6, where the other “O” tags represent 98% of the instances, which will affect the evaluation results as shown in
Figure 7.
According to
Figure 7, the evolution results for tags with support greater than 400 are considered almost good, compared to those with support less than 400 as illustrated in
Figure 8. So, we can conclude that the amount of support will affect the results of the evaluation metrics in some way, where smaller tags might have lower precision or recall because the model struggles to learn from fewer examples, and tags with more support provide more data, which usually leads to more reliable evaluation metrics.
On the other hand, when analyzing the results according to the B tag and I tag, as presented in
Table 10, we can conclude that the B-tags exhibit superior performance across various categories. The model demonstrates a heightened level of confidence in recognizing the initiation of entities in contrast to their internal components. I-tag performance is low in categories like “AGE”, “CITY”, and “COUNTRY”, indicating that the model struggles to maintain entity recognition beyond the first token. This could be the result of the model leaning towards making simpler, single-token predictions or a deficiency of sufficient training data for multi-token entities. I-tags perform well in categories like “DATE”, “HOSPITAL”, and “STREET”. These entities often have more structured patterns and contain more tokens, which makes it easier for the model to capture internal tokens efficiently.
For most of the 23 evaluated categories, the CRF model outperforms the Bi-LSTM-CRF model, which has shown promising performance across most of the 23 evaluated categories. However, its effectiveness is influenced by the need for a larger dataset to leverage its complex architecture and hyperparameter tuning fully. While the CRF model has a simpler architecture and is generally less prone to overfitting, the Bi-LSTM-CRF model requires a larger volume of data to maximize its potential. Although the Bi-LSTM-CRF has been carefully optimized and tuned, expanding the dataset would allow for even more effective performance improvements and better utilization of the model’s capabilities.
4.4. Comparison Between Our Findings with Other Studies
Clinical data processing has been improved using Bi-LSTM-CRF techniques for PHI de-identification in EHRs. This method, which combines the advantages of sequence labeling and deep learning, improves the accuracy and efficiency of identifying sensitive information.
In [
34], the authors used the Rennes University Hospital’s EHRs to create a French de-identification dataset. The distribution of entities in the training and test sets was consistent, and the dataset included extensive personal information. A Bi-LSTM + CRF model was assessed using manually annotated clinical reports in conjunction with Flair and FastText word embeddings. The model outperformed other tested models and demonstrated the efficacy of this strategy for de-identifying sensitive information with a high F1-score of 96.96%. To identify text that should be classified as sensitive due to privacy concerns, the authors in [
35] used three different models: Bi-LSTM, Bi-LSTM with attention, and BERT base models. The words used can disclose important details about a patient’s character and personal life, even though it may not at first appear to be sensitive to privacy concerns. Clinical data include hidden features that may cause privacy problems in addition to demographic data. Approximately 92% accuracy was achieved by enhancing the Bi-LSTM model with an attention layer that highlights crucial words that are important for categorization. The dataset comprised 206,926 phrases, of which 80% was used for training and the remaining 20% for testing. Using the Bi-LSTM model alone, the dataset yielded an accuracy of about 90%. In [
36], the authors presented a novel multi-medical entity recognition approach combining BART, Bi-LSTM, and CRF using a fusion strategy. EHRs were first cleaned, encoded, and segmented. Then, semantic representations were dynamically merged using the BART model. Subsequently, sequential data were captured by the Bi-LSTM network, and CRF was utilized to decode and produce multi-task entity recognition outcomes. The approach produced an average precision, recall, and F1-score of 0.880, 0.887, and 0.883, respectively. Based on the BiLSTM-CRF deep learning model, the authors in [
37] presented the Bi-RNN-LSTM-RNN-CRF electronic medical record named entity recognition model. A bidirectional RNN-LSTM-RNN layer was trained by first gathering an electronic medical record dataset and then utilizing a word vector tool to turn the characters into vectors. After that, a CRF layer received the training data to compute the loss function and produce predictions. To compare the two models, identical procedures were carried out using the conventional BiLSTM-CRF model. According to the experimental results, the Bi-RNN-LSTM-RNN-CRF model outperformed the BiLSTM-CRF model in recognition, with an F1-score of 97.80%. Xiaocheng et al. [
19] used a manually annotated corpus to refine the bidirectional transformer pre-training model BERT by the BIOES (Begin Inside Outside End Single) standard. Word vectors, which successfully capture the context in electronic medical records, represent the semantic content of words in the text, which is learned unsupervised. Character sequences are sent into the BERT model, which learns their state properties. The CRF layer uses this information to optimize sequence transition constraints. The BERT-IDCNN-CRF model obtained an average accuracy of 94.5%, recall of 93.8%, and F1-score of 94.1%, respectively.
Our research demonstrates that CRF performs better than Bi-LSTM-CRF in the majority of the 23 assessed categories. This is probably because of the size of the dataset and the significant chance of overfitting in deep learning models on smaller datasets. Our findings show that while Bi-LSTM-CRF shows potential for larger and more diverse datasets, CRF offers a solid baseline with robust generalization. The dual-model strategy used in our work demonstrates the significance of choosing a model based on the properties of the data and the necessity of giving similar importance to deep learning and simpler architectures to attain the best results across a variety of datasets. By doing this, we offer a thorough analysis that clarifies how our methodology successfully manages the trade-offs between model complexity and performance, resulting in a more sophisticated knowledge of PHI detection challenges.
Table 11 compares the methods for de-identification used in other recent research.
5. Conclusions and Potential Future Works
The privacy protection mechanism in healthcare has become critical due to the rapid global adoption of EHRs, which contain vast amounts of unstructured textual data. The sensitive nature of Protected Health Information (PHI) within these records presents significant privacy challenges, necessitating robust de-identification techniques. This paper introduces a novel Bi-LSTM-CRF model approach, utilizing the i2b2-2014 dataset from Harvard University to achieve accurate and reliable PHI de-identification. Unlike prior studies that often unify Bi-LSTM and CRF layers, our approach emphasizes the individual design, optimization, and hyperparameter tuning of both components, resulting in precise model performance improvements. This rigorous approach to architecture and tuning, typically underexplored in the existing literature, enhances the model’s capacity to effectively detect PHI tags while retaining the essential clinical context.
Comprehensive evaluations were conducted across the 23 PHI categories defined by HIPAA, ensuring robust security across vital domains. The optimized model demonstrates exceptional performance, achieving a precision of 99%, recall of 98%, and an F1-score of 98%, which underscores its effectiveness in balancing recall and precision. By enabling the de-identification of medical records, this research strengthens patient confidentiality, promotes compliance with privacy regulations, and facilitates safe data sharing for research and analysis. Interestingly, certain PHI categories displayed outstanding accuracy, confirming the model’s efficacy in identifying and safeguarding PHI within medical records.
The performance differences observed between the CRF and Bi-LSTM-CRF models suggest a range of influences, including dataset characteristics, specific tasks, and model architecture.