1. Introduction
Deep learning has significantly enhanced computational efficiency in recent years, as a result of continuous scientific research and the upgrading of hardware requirements, and it has emerged as the most vigorous technology in the field of artificial intelligence. Many applications are created using deep learning algorithms with a large amount of data and some medical applications are invaluable towards healthcare system. Our study aims to produce a ICD-9 prediction model by only utilizing subjective component. The study implements two distinct deep learning approaches to produce a prediction model. The prediction model behaves as an assistance tool that helps to know the ICD-9 codes just before approaching the hospital. This model can also act as a clinical support tool to aid medical professionals. Most of the existing medicine recommendation systems aid doctors to choose reliable clinical decisions [
1]. Early prediction of any disease can help to move further towards life safety. A recent study develops a computational model to predict sepsis at the early stage [
2]. There are various techniques involved in predicting the recurrence of breast cancer to assist the process of decision making [
3,
4,
5,
6,
7]. EHRs contain massive storage of data that are invaluable for healthcare system [
8]. Utilizing the massive amount of data, decision making systems aids to contemporary healthcare system.
The World Health Organization (WHO) categorizes an International classification of diseases (ICD-9) for disease maintenance [
9]. ICD-9 is widely used in all localities of healthcare to report diagnosis. The ICD-9 system assigns a specific code to each illness. The physicians assign the codes based on the Electronic Health Records (EHR) for determining severity and tracking diseases. Some countries acquire ICD-9 codes for compute billing and estimating severity [
9]. Medical professionals assign the ICD-9 codes in order to support a doctor’s clinical diagnosis. In this technological age, it is essential to be able to predict ICD codes [
10,
11]. Thousands of ICD codes are used by the international medical community to classify illness, accident, and cause of death [
12].
Many countries consider ICD as the tool to follow up death and statistics of death. The ICD system has more than one version, which depends on the situation in different countries and their clinical conditions [
13]. Various countries have adopted the ICD-10, including the United States. The United States follows ICD-10-CM and Canada follows the ICD-10-CA in their health care system. In Taiwan, the ICD-9-CM, which has been in use since 1994, has now changed to ICD-10-CM. There are some modifications between ICD-9 to ICD-10, such as attaching laterality and recategorization [
14]. However, this study focuses on the ICD-9 code prediction. Since our medical record data still follows ICD-9, in this study we focus on the ICD-9 code prediction. In addition, our proposed method is generic to ICD-9 and ICD-10. The ICD-9 codes contain various levels such as chapter, block, three-digit code, and full code. There are more than 13,000 distinct codes of the ICD-9 system, which are divided into 17 chapters of disease codes and 2 chapters of trauma and supplemental classifications. Further, 17 chapters are divided into 135 blocks which contain the disease codes.
Figure 1 demonstrates the different levels of ICD code. The hierarchical table of code 765.02 is mentioned for better understanding in
Table 1. ICD-9 format is represented by using five digit code, the first three digits are called category and the last two digits indicate the cause (Etiology), anatomic site (Anatomic site), and symptom sign (Manifestation);
Figure 2 clearly demonstrates the format using the same code.
In this study, we collaborate with a medical center to develop our prediction model. Outpatient records, emergency medical records, and inpatient medical records are the three types of medical records that are commonly used. Clinical medical records include information such as hospitalization records, medication descriptions, test reports, analysis records, treatment plans, and derived ICD codes. Whereas admission records, hospitalization course records, appointment records, intrusive diagnosis or procedure records, surgical records, anesthesia records, consent documents, prescription records, antibiotic use, progress notes, and discharge summary are found in a hospitalization medical record. The POMR (Problem-Oriented Medical Record) is a recording approach of medical data; It consist of four parts, which include database, problem list, initial plan, and progress note. The POMR follows the format of SOAP, which consists of subjective component, objective component, assessment, and plan [
13,
15]. In the SOAP format, the subjective component is said to be a training data, which includes feelings, opinions, and complaints of patients.
The contributions of this study are two things. The major contribution is the construction of deep learning-based approaches to forecast ICD-9 codes by only applying subjective components. The second is that the study predicts 1871 ICD codes and especially scores 57.91% of recall in the chapter level.
2. Related Work
Multi-label classification approach captures raw nursing notes to predict ICD-9 chapters [
16]. The majority of studies utilize nursing notes for the ICD-9 code prediction [
17,
18]. In our study, a subjective component is used as training data to predict disease codes. This model aids doctors in deciding relevant ICD-9 code between the vast number of ICD codes available. Our model aids the medical personnel in choosing a reliable ICD-9 code. Additionally, the concept is used in a variety of applications, including medical chat-bots. At the same time, a deep learning based model for ICD-10 is emerging in the current scenario. The recommendation system for ICD-10 is implemented using a GRU based model [
19]. Previous research creates a deep learning model to minimize human effort and automatically detect ICD-10 codes. This model is based on diagnosis records to forecast the ICD codes [
20]. A recent study performs an ICD-10 automatic coding for primary diagnosis in the Chinese context. This method works with the help of discharge procedure and diagnosis texts and performs well in cardiovascular diseases [
21].
A recent study developed an app that aids medical personnel in predicting incisional hernia occurrence [
22]. ICD-9 codes are utilized to cross verify the eligible patients. Their study depicts a perfect demonstration of transferring an institutional dataset into a application. Our prediction model can assist in developing a remote assistance type of application. Prior research proposes an ICD-9 prediction method that uses a transfer learning approach to transform MeSH index information into ICD-9 codes [
23]. A previous study proposes an approach for assigning ICD-9 codes to track a disease history of patients, which helps in the billing system [
18]. In this process, the ICD-9 code prediction is made based on the clinical notes. Clinical notes are used to classify the top 10 ICD-9 codes and blocks. Enhanced Hierarchical Attention Network (EnHAN) [
24] utilizes discharge summary to solve ICD-9 prediction problems. To deal with multi-class label problems, the method uses topical word embedding. Moons et al. [
25] solve the ICD-9 prediction using discharge summaries. Multiple approaches are used in the study to identify ICD-9 codes. On the discharge summary, Word2Vec is used in the multi-label classification [
26]. A recent study uses TF-IDF in the process of generating embedding vectors [
27]. The CNN is used to identify ICD-9 codes in their process. On the subjective components, we have used Word2Vec in our study. In addition, we use LSTM and GRU to predict ICD-9 codes in our study.
In our previous study [
13], we employed a CNN based network to predict ICD-9 code, which scores the recall of 58% in chapter level. In this work, we applied the LSTM and GRU network to predict ICD codes. We removed the list of stop words and applied TF-IDF in this approach. Similarly, this approach scores a recall of 57.54%. In addition, the top-10 prediction model scores a recall of 67.37%.
The prior study takes its base knowledge from ontologies to understand clinically related features for developing a robust deep neural framework that achieves disease diagnosis [
28]. The SVM classifier specifically predicts ICD-9 codes of mechanical ventilation, phototherapy, jaundice, and Intracranial hemorrhage using ICU notes. N-gram feature extraction methods are utilized in their approach [
29]. Previous studies evaluate supervised learning approaches to predict 1231 ICD-9 codes using the EMR dataset. In this method, the EMR is gathered from three types of datasets that have differences in the number of codes and size of the EMRs. This study implements the model with the discharge summary in the prediction process [
30]. Prior research proposes a hierarchical model along with an attention mechanism to assign ICD codes with the use of diagnosis description. This study assigns just 50 ICD-9 codes and scores 0.53 of F1 scores [
31]. Our study handles 1871 codes in the process of ICD prediction. Another study deals with diagnosis description to forecast ICD-9 codes. The study creates a neural structure for automatic ICD coding. The neural structure scores 0.29 of sensitivity. The study handles 2833 ICD codes in the process of computing ICD code prediction [
32]. The reason behind less sensitivity is that the study involves a large number of ICD-9 codes in the prediction. FarSight, a method with long-term integration technique to find the onset of the issue with early symptoms. This research consumes unstructured nursing notes to produce 19 ICD-9 chapter codes [
33].
With the advancement of the medical field, a vast number of electronic health records (EHRs) are exchanged with healthcare providers in order to improve medical services. H. Li takes advantage of EHR data to develop a reliable bone disease prediction model [
34], by analyzing and interpreting the data in the EHR, which helps to develop a medical system. A previous study was designed to predict heart disease based on that data [
35]. To train our prediction model, we have used subjective components that reflect a patient’s feelings about their illness. The prediction model can work as a self assistance tool that aids to identify ICD-9 codes.
Table 2 shows the comparison of our study with other approaches. In this study, we predict ICD-9 codes with the help of subjective components. This comparison shows the uniqueness of our study.
3. Materials and Methods
3.1. The Data
The entire data holds a total of 146,343 medical records. The data include 11 fields of attributes, which are hospitalization number, date, time, medical record number, author, subject, ICD code, subjective component, objective component, assessment, and plan. The subjective components mainly record the patient’s emotions and opinions regarding the illness. In this study, we use the subjective component as the training data. According to the dataset, the listed data ranges between 2012 and 2017. In around 140,000 medical records, there are 1871 different disease codes, which are distributed in the 17 chapters of ICD-9 in the supplementary category. In our data, there are 24 types of disease codes with more than 1000 medical records, which are approximately 40% of the total data. There are 234 types of disease codes with more than 100 medical records, which are approximately 80% of the total data volume. Level 1 (chapter) consists of a total of 17 chapters with a supplementary category, level 2 (block) holds 128 codes, level 3 (three-digit code) holds 624 codes, and level 4 (full code) consists of 1871 codes.
In our dataset, the majority of records is from respiratory diseases (chapter 8). More than 25,000 (17%) medical records are related to the respiratory diseases. Secondly, tumors (chapter 2) data has a higher volume in our data set with a proportion of 15%. Subsequently, circulatory system diseases (chapter 7) consist of 13% of data; Digestive system diseases (Chapter 9) hold 12% of data in the total data set. Further, chapter 18 has accumulated more than 6000 medical records.
3.2. Word2Vec
Word2vec is a model that aids in the translation of vocabularies into vector representations [
36,
37]. The core concept of the code suite is derived from the concept of word vectorization [
38]. The Word2Vec model consists of CBOW (Continuous Bag-of-Words) and skip-gram model. Based on the previous and next words, the CBOW predicts the current target words. The input layer intakes the previous and next word to produce the (current) target word. On the other hand, the skip-gram model predicts the previous and next word using the current target word as an input. The input layer intakes the target word as a input to produce the previous and next word as a output. According to the semantic sense, the Word2vec model transforms each vocabulary into a word vector. As a result, similar words are grouped together in a high-dimensional space;
Figure 3 clearly illustrates the process of Word2vec.
3.3. Data Cleaning and Word Segmentation
Numerous numerical values, such as record time, date, and so on, are used in subjective components. The first step in pre-processing is used to remove numeric characters from the text. Subsequently, we convert all English words to lowercase letters; We remove all punctuation marks and special symbols to achieve more reliable word segmentation results. In the segmentation stage, we used the jieba segmentation suite (Jieba:
https://github.com/fxsjy/jieba (accessed on 3 September 2019)) which is widely used in the field of Chinese word segmentation. The native kit is developed based on simplified Chinese. In our research, subjective components are based on traditional Chinese. Accordingly, we have utilized jieba-tw (Jieba-tw:
https://github.com/APCLab/jieba-tw (accessed on 3 September 2019)) to solve this problem. The total number of words in our text, after the data cleaning and word segmentation is 27,196. We also added English Stop Words (ESW) and Chinese Stop Words (CSW) to the stop word list in this study. We involved custom stop words such as One Count Stop Words (OCSW), One Character Stop Words (OCharSW), TF-IDF (Term Frequency-Inverse Document Frequency) stop words (TSW), and IDF stop words. OCharSW are used to deal with special symbols that contain a single character. The results of word segmentation are considered as an input for OCSW, which handle statistically trivial words that are used only once.
3.4. Term Frequency-Inverse Document Frequency
Additionally, we utilize the approach of TF-IDF [
39] to calculate the frequency of each word in all texts to observe the significance of each word in each medical record.
TF in TF-IDF stands for Term Frequency, which is represented by
, which is depicted in Equation (
1), where the subscript
t represents a specific word (term),
d represents a specific article (document),
represents the number of occurrences of the word
t in the article
d;
represents the sum of the number of occurrences of all words in the article
d. The TF value can help us to understand the frequency of a word in an article. IDF stands for Inverse Document Frequency, which is represented by
, which is depicted in Equation (
2), where
represents the number of articles containing the vocabulary
t, and
N represents the total number of articles. The IDF value can help us to understand the frequency of a word in all articles.
The value of TF-IDF is the result of multiplying TF and IDF, which is depicted in Equation (
3). The larger value represents the higher importance of the word in the article. The TF-IDF value of each word in each medical record is different, we take the sum of the value of each word in different medical records and take the average as the criterion, and rank them according to this value. Consequently, we can find from this ranking that the words with lower weights are not directly related to the disease code or that the word segmentation process has not been processed properly.
3.5. Word Encoding and Word Embedding
After pre-processing, our medical record text data contains a total of 14,767 words with the longest medical record, which comprises 147 words. Subsequently, we build a dictionary with all of the words in the text and their corresponding number. As a result, each word is represented by a number, and each text medical record is then converted into a vector representation. After the text is transcoded, we utilize Word2vec to convert the text of the medical record into a vector. We organize words that have the same meaning, which are then allocated to a similar place in the higher dimension space. The common length is stretched in order to use the embedding layer to translate the encoding into a vector.
Medical documents have a maximum word count of 147, which is the regular limit for all texts. As a result, short texts are padded in order to meet the limit. Finally, one hot encoding is used to convert each word into a one-dimensional vector. Each word vector becomes one hot vector with a length of 14,767 after the above conversion process. As a result, the input dimension of the embedding layer is 14,767, and the output dimension is 300. Each medical record vector contains 147 word vectors, which represents the maximum length of training data.
3.6. Deep Learning Methods
We utilize the pre-processed data to implement LSTM and GRU.
Figure 4 depicts the model architecture, which includes both LSTM and GRU. The architecture of the two models is identical. The input of network is a subjective component. ICD-9 codes are the output of this network. A total of 1871 ICD-9 codes are predicted using our model. The main difference lies in the different parameter settings of LSTM and GRU units and detailed model parameter settings.
6. Discussion
Our prediction model behaves as an assistance tool that helps in getting to know the ICD-9 codes before approaching the hospital. The subject component is a primary part of the SOAP note. Other reports hold more detailed reports including the medical results, the long process of the medical treatment, and diagnosis methods. This study aims to provide a prediction model to help patients who want to go to the hospital. To achieve that, we have chosen the subjective components (feelings, opinions, and complaints) in this study. Moreover, the research study shows the importance of subjective components with other types of reports.
In the initial stage, we applied data cleaning and segmentation. Subsequently, we removed various stop words during the experiment.
Table 11 shows the effects of removing stop words in the experiment. According to the findings, the recall is increased a little bit after utilizing English and Chinese stop words. We chose to retain more original data because there is no great difference in the increase rate, so we decided to keep Chinese stop words at this level.
We used the LSTM network as the first option for training the deep learning model in this study, and we experimented with several different changes in the natural language processing steps of the training data. First and foremost, prior to the word segmentation, we removed punctuation marks, special symbols, numbers, and list of stop words from the data in the pre-processing. The OCSW was prioritized in our study. This approach helped us to remove the unimportant words, which appear only once in the training data.
Table 11 illustrates the results of both English stop words (EnSW) and Chinese stop words (CSW). The recall is increased slightly after using English and Chinese stop words. Next, we found out that there are many single characters and special symbols, such as ↓,
and other special symbols, which come to a total of 1312 characters in our data. We removed those data with the help of OCharSW. The recall was 0.5%, which improved as a result of applying this approach.
Figure 8 shows 95% of the prediction rate in Chapter 15 (certain conditions originating in the perinatal period).
Figure 7 depicts the prediction rate of the block (other conditions originating in the perinatal period (764–779)) which reached 95% and this block belongs to Chapter 5. The chapter only has two blocks (maternal causes of perinatal morbidity and mortality (760–763)) and (764–779), with 11,685 and 16 medical records, respectively. Due to the disparity in their number distribution, the prediction rate of chapter and block remained at 95% and further improvement is not achieved.
Figure 8 confirms that chapter 5 (mental disorders) achieves 81% of the prediction rate. Similarly,
Figure 7 depicts the block (other psychoses (295–299)) gained 78% of the prediction rate.
Chapter 5 covers “mental disorders”, and Chapter 15 describes “certain conditions originating in the perinatal period”. In our dataset, the patient self-reports of Chapters 5 and 15 consist of relatively short sentences, and are composed of specific medical terminologies, which leads to higher accuracy.
The idea behind the preference of recall is that the cost of failing to predict the disease of a patient is much more important than the cost of admitting a healthy person to involve in more tests. This study is intended to help patients who have a need to go to the hospital. This study prefers to predict the ICD code as much as possible in this approach.
Our study has major uniqueness with previous studies in the process of ICD-9 prediction. Most of the study focuses on handling a minimal number of ICD codes. Some of the studies are a overperformed recall of this study. However, this research predicts 1871 ICD codes using a subjective component. This study scores the 0.57 of recall in the ICD chapter level prediction. A previous study utilized more medically specific data such as discharge summary, nursing notes and diagnosis summary. Numerous medical tests and treatment processes are involved in those input data. This study takes a basic complaint to compete with previous studies.
Table 12 shows that the input words of the medical record support to help the correct predictions. The long hospitalization days, similarities in the daily medical records, same consulting doctors, and similar inner feelings of patients can be the reason of the input data similarity. The wrong prediction shows that the complete different words cause the failure of correct prediction, however, the prediction results are quite close. The wrongly predicted codes 774.1 and 770.1 belong to the same block (764-779) which indicates that the chapter and block are correctly predicted.
There are two disease codes that are very close in meaning among the top 20 disease codes in the research results, these are code 486 (pneumonia) and code 485 (bronchial pneumonia).
Figure 5 shows that ICD-9 code 486 scored 50% and 485 scored 53% in the confusion matrix of Level 4 (full code) prediction. Repeated identical words are more likely to cause errors in the prediction of these two codes, especially when predicting code 485. The most obvious difference between the two disease codes is that the model can correctly predict that code 485 usually has the word “rhinorrhea” in the medical record, indicating that the word is a common symptom or the vocabulary is commonly used when writing medical records, and this argument is quite reasonable from the point of view of the disease represented by code 485.
Threats to Validity
Construct validity: This research aims to predict ICD codes using our prediction model. The subjective component is a part of SOAP, which is used in this study to predict ICD codes. This model behaves as an assistance tool that helps to know the ICD codes just before reaching the hospital. Our model predicts 1871 ICD codes in this study.
Internal validity: Selection threats are handled in this study. Data cleaning, segmentation, and NLP approaches are involved in the data pre-processing to predict the prediction model. The study builds a model to solve the ICD prediction.
External validity: The study can be used to predict an ICD using the simple self-report of patients. This model can help any patient with their feelings about their medical issue. The study is generalized for any other patient.
Conclusion validity: This study uses the GRU model as a benchmark for defining model parameters in our experiment process. The 10-fold cross-validation is processed in this study and the results are discussed in this study.