1. Introduction
Bipolar disorder is a severe mood disorder characterized by alternating episodes of depression and mania [
1,
2]. During periods of mania, patients may exhibit unusually energetic, happy, or irritable behavior and have reduced sleep. During depression, patients may cry inexplicably, have a negative attitude toward life, and have poor eye contact with others. According to statistics, 6% of patients with bipolar disorder die by suicide and another 30–40% suffer from self-harm. Many patients with bipolar disorder also suffer from other mental illnesses, such as substance abuse addiction, anxiety disorders, etc. According to academic research, people with bipolar disorder account for about 1% of the global population [
3]. In the United States, approximately 3% of the population experiences bipolar symptoms at some point in their lives, with no significant gender differences [
4]. The most common age for the onset of symptoms is between 20 and 25 years old. The younger the age, the worse the prognosis [
5].
It is known that machine learning provides advanced skills for better diagnosis of illnesses. A recent review aims to explore studies based on machine learning models used to diagnose patients with bipolar disorder [
6]. After preprocessing and screening, 33 articles that met the inclusion criteria were identified. Among them, various types of data and machine learning algorithms were used to develop models for diagnosis of bipolar disorder. The accuracy ratios of these studies were very inconsistent, ranging from 0.98 to 0.64. Therefore, there is still potential for improvement in predictive performance of this topic. Among previous studies based on machine learning models for detecting patients with bipolar disorder, only the study analyzed by Paulo J. C. Suen et al., used feature data from the electronic health record system, but their target dataset consisted of only 155 samples [
7]. In addition, feature data used in this study included personality characteristics, depression severity, anxiety level, affect scale, etc., all of which require clinical assessments by professional psychologists and psychiatrists. Two supervised learning methods were adopted to train classification models: logistic regression and the XGBoost algorithm. The results of this study showed an accuracy of 0.78–0.57, a true positive rate (TPR) of 0.75–0.50, a true negative rate (TNR) of 0.81–0.64, a positive predictive value (PPV) of 0.69–0.43, and a negative predictive value (NPV) of 0.86–0.70. On the other hand, two previous studies constructed a diagnostic model for bipolar disorder using approaches based on kernel density estimation (KDE). In the research of Benson Mwangi et al., gray and white matter density maps were obtained from neuroimaging scans and were analyzed with the relevance vector machine algorithm. However, the dataset consisted of only 256 samples, and the accuracy ratios of the trained models for detecting bipolar disorder ranged from 0.703 to 0.649. Results of other evaluation metrics included a TNR of 0.742–0.711, a TPR of 0.664–0.586, a PPV of 0.714–0.671, and an NPV of 0.685–0.634 [
8]. On the other hand, in the study by Julia O. Linke et al., diffusion tensor imaging data were acquired from 118 participants and were used to train Gaussian process classifiers to identify bipolar disorder patients. Evaluation results of this study included a TPR of 0.682–0.611, a TNR of 0.842–0.591, and an accuracy of 0.754–0.601 [
9]. In addition, there are many other studies that tried to identify bipolar disorder patients with machine learning methods [
6]. In the study by M.I. El Gohary et al., the models trained with a support vector machine were used to discriminate bipolar disorder patients from control samples based on recording of the electroencephalography rhythms of participants. The evaluation results showed that their prediction model could reach an accuracy of 0.980–0.740, which is the best performance so far. However, only 230 samples were collected in the study, so the validity of the prediction model needs further verification [
10]. In summary, since these previous studies used different datasets to construct models for identifying bipolar disorder patients, we cannot directly compare their evaluation results. However, almost all of their feature data need to be obtained through special medical instruments or biochemical tests, such as immune–inflammatory signature, blood samples, magnetic resonance imaging, electroencephalography, genomic data, etc. [
6]. Therefore, the lack of easy access to analytical data presents a barrier to the development of machine learning models. On the other hand, among these 33 studies, only 2 of them (6%) had a sample size greater than 2000. The analysis data of all remaining studies contained less than 1000 samples [
6]. Theoretically, if the analysis sample size is smaller, the representativeness of the contained data will be limited. This may be why the accuracy ratios of these studies are very inconsistent.
There is an intuitive way to improve the performance of machine learning. Different models can be trained using the same dataset, and their prediction outcomes will then be integrated. Therefore, the idea of “knowledge distillation” has been proposed and has been verified in various studies. For the practice of knowledge distillation, firstly a sophisticated model or multiple models will be trained using any learning algorithm. Outcomes produced by this group of “teacher models” can be thought of as conditional distributions for the input data and may be referred to as “soft labels.” These data distributions can be used as the learning targets for the “student model,” which will be trained using simpler learning architectures [
11,
12]. On the other hand, the soft labels can be used as the reference information and to train the student model together with the original input data. This process may also be seen as the student model “distilling” the “knowledge” provided by the group of teacher models [
13].
Kernel density estimation (KDE), which is a nonparametric estimation approach in statistics, has been widely exploited to identify distributions in various types of datasets. A kernel density estimator generates an approximate probability density function (PDF) by computing the linear combination of the weighted kernel functions placed at the locations of all data instances in the vector space [
14,
15,
16]. Inspired by the aforementioned previous research, we attempted to improve their limitations while developing machine learning models for the diagnosis of bipolar disorder. Firstly, the type of data analyzed in this study is medical history information, which is readily available from the electronic health record system. Secondly, the dataset of this study contain tens of thousands of samples, which is much larger than the sample size of existing studies, thus improving the representativeness of the constructed models. Moreover, based on the concept of knowledge distillation, the PDF values produced by the KDE method were transferred as the soft labels to construct prediction models of bipolar disorder using various machine learning methods. According to the evaluation results, using the data distribution information generated by KDE did indeed improve the predictive performance of the diagnostic model for bipolar disorder. In addition, the branching attributes selected by the decision trees were mapped back to specific disease diagnoses that are all associated with bipolar disorder. To the best of our knowledge, this study is the first attempt to apply KDE to knowledge distillation for supervised machine learning.
2. Materials and Methods
2.1. The Input Data
In the early 2000s, the Laboratory for Computational Physiology of the Massachusetts Institute of Technology (MIT) began to implement the project Integrating Signals, Models and Reasoning in Critical Care. The main goal of this project is to build a large dataset for research based on intensive care, the result of which is the Medical Information Mart for Intensive Care (MIMIC) database. The contents of this database come from Beth Israel Deaconess Medical Center (BIDMC). MIMIC is a publicly shared medical database. It contains de-identified information from electronic medical records for thousands of adult patients admitted to medical/surgical intensive care units and emergency wards. The development of this database was approved by the ethical review boards of BIDMC and MIT. MIMIC has been used extensively by academic researchers around the world to help promote advances in clinical informatics, epidemiology, and machine learning [
17].
In the database tables of MIMIC, all of the information of a single patient is concatenated with the field value of “subject_id.” In this case-control study, the case group included patients with bipolar disorder and/or related symptoms. The following diagnostic codes were used when selecting case samples from the “diagnoses_icd” table. The ICD-9 versions were 296.40~296.45, 296.50~296.56, 296.60~296.62, and 295, 298, and the ICD-10 versions were F20, F29, and F31. Then, 10,000 people were randomly selected from these bipolar disorder patients to form the case group. The date that bipolar disorder was first diagnosed for each patient, i.e., the field value of “admittime,” was regarded as the index date. Finally, for each patient, the subject_id was used to retrieve all of his/her diagnosis records in the database.
On the other hand, the control sample did not have diagnoses of bipolar disorder or any associated symptoms in the database. They were matched with patients in terms of age and gender, i.e., the field values of “gender” and “anchor_age” from the “patients” table. In addition, in the month of the index date for a patient, the corresponding control sample needed to have a diagnosis record that represented similar health status. Based on the aforementioned matching conditions, this study selected the control samples at a ratio of 1 to 1 (i.e., 10,000 samples) and 1 to 3 (i.e., 30,000 samples). Finally, for each control sample, the subject_id was used to retrieve all of his/her diagnosis records from the database to form the input data.
2.2. Kernel Density Estimation
Kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights [
14,
15]. KDE answers a fundamental data-smoothing problem where distributions about the population are carried out [
16]. For the basic definition of KDE, let (
x1,
x2, …,
xn) be independent and identically distributed samples drawn from a specific distribution with an unknown density
f at any given point
x. Its kernel density estimator can be defined using Formula (1).
In Formula (1),
K(
x −
xi;
h) is the kernel function, whose outcomes are non-negative values. There exists a range of kernel functions being used, such as cosine, linear, normal, etc. [
14,
15]. The positive variable
h is called the bandwidth, which is a smoothing parameter and exhibits a strong influence on the resulting estimation.
2.3. Embedding Vector
In the application of machine learning, the content of category data needs to be converted to a special format before subsequent analyses can be performed. The idea is that an embedding vector will present a categorical data item (such as a word in a text) in the form of a multi-dimensional vector. Each element of the vector is a real number, and the contents of the vector can reveal the properties of the original data items [
18]. The embedding vector can be generated by the parameter optimization mechanism using a specific neural network architecture [
19,
20]. As for the loss function required in the learning process, its basic concepts are defined as Formula (2).
Formula (2) represents the conditional probability of correctly judging the context (i.e.,
m words before and after
wi, which constitute contents in the sliding window as
wi−m, ……,
wi−1,
wi+1, ……,
wi+m) with the word vector
wi as the input premise. The probability value can be increased as much as possible through the parameter optimization mechanism. Then sum of the conditional probability values of all of the words in the full text (e.g., a total of
N words) and the logarithm function are used to simplify the computation process. The expected loss function is shown in Formula (3).
When implementing the program suite of this loss function, the data structure of the Huffman tree can be used to improve the computational performance.
The “word2vec” proposed by Google in 2013 is currently the mainstream embedding vector algorithm [
19,
20]. The algorithm combines two learning mechanisms: skip-gram and CBOW (continuous bag of words). In the calculation of skip-gram, the word vector
wi is used as the input premise, and the predictions of
m word vectors before and after
wi, which constitute contents in the sliding window as
wi−m, ……,
wi−1,
wi+1, ……,
wi+m, are produced. On the other hand, in the computation of CBOW, the 2
m word vectors within the sliding window, i.e.,
wi−m, ……,
wi−1,
wi+1, ……,
wi+m, are used as the input premises, and the prediction of the word vector
wi is output.
2.4. Machine Learning Algorithms
The support vector machine (SVM) is a supervised learning model that analyzes data for classification or regression [
21]. Supposing the analysis data can be viewed as a vector point in a multi-dimensional space, SVM tries to construct the hyperplane as the discriminator for data categorization. However, since the data points may not be linearly separable in the original space, they can be projected to another multi-dimensional space where a good categorization is achieved by the hyperplane with the largest functional margin to any data point of any class. When providing a dataset of n points (
x1,
y1), …, (
xn,
yn), where
xi is a multi-dimensional vector and
yi is the class label, the hyperplane can be written as the following conditional formula.
In this formula,
w is the normal vector of the hyperplane, which is constructed by data points called support vectors.
φ(
xi) is the projected data points and b is the computational bias. Satisfying the condition of this formula means that the class label of xi is correctly predicted. On the other hand, both
w and
φ(
xi) are in the projected multi-dimensional space, and the inner product calculation between them can be replaced by a specific kernel function, such as the linear kernel, the radial basis function (RBF) kernel, etc. [
22].
The decision tree is a hierarchical model that uses a tree-like structure. In this model, each internal node represents a test on an attribute, and each branch represents the outcome of the test. At the bottom of the structure, each leaf node represents a class label, which is the decision made after analyzing all of the attribute features [
23]. The path from the root node to a leaf represents a specific decision rule, and the conditions along the path form a conjunction of “if–then” clauses [
24]. The decision tree is a white-box model because the decision rules produced are easy to understand and interpret. Among various types of node-branching functions, the Gini impurity is constantly used and was chosen for this study. According to the relative frequencies of class labels in the dataset, the Gini impurity measures how often a data item will be incorrectly labeled if it is labeled randomly and independently. For a dataset of items with
J class labels and relative frequencies
pi,
i ∈ {1, 2, …,
J}, the probability of correctly recognizing the class label of a data item, assuming it is class
i, is
pi. On the contrary, the probability of misclassifying that item is
. Therefore, the computation formula for the Gini impurity I
G(
p) is defined as follows.
IG(p) reaches the minimum value of zero when all data items in the node fall into a single class label.
An artificial neural network is a machine learning algorithm that imitates the human nervous system, and its definition formula is as follows [
25,
26].
Because the neural network can have a plurality of input and output neurons, they will be assembled respectively into the “input layer” and the “output layer.” Matrix X represents the input values of a set of attributes, and matrix Y simulates the output neurons for the computation results. The weight matrix W simulates the axons, which connect the input/output neurons and are responsible for transmitting messages. In the application problem, this represents the respective influences of different attribute characteristics. Matrix B of the bias values simulates synapses and represents the degree to which the output neurons are activated. The symbol represents the activation function, which accepts a weighted sum of input values and performs a special calculation. If the resulting value is greater than the threshold, the output neuron is activated and the message is transmitted. In addition, a “hidden layer” can be added to the network architecture, which contains nodes that mimic internal neurons. Since the hidden layer makes the network structure more complicated, it can handle more kinds of application problems or simulate the interaction of more complex attribute features.
2.5. The Analysis Procedure
This study used the concept of knowledge distillation to construct predictive models of bipolar disorder. After the patients and control samples were screened from the MIMIC database, all of their diagnosis records in the database were selected as the input data. In the MIMIC database, an average of 20 different disease diagnoses were recorded for each sample. Using the aforementioned word2vec algorithm, these disease diagnoses were converted into 8-dimensional embedding vectors. Therefore, the input data of each sample was stored in a 20 × 8 matrix structure. The research team then planned two analysis procedures as follows.
Referring to
Figure 1a, in the first procedure, KDE, was used to estimate the probability density function, representing the distribution for the input data X. After the X data were input into the density function, the soft label information X
pdf was produced, which represented the likelihood values of the data distribution of input X. Next, X
pdf was used as the input attributes of the training dataset, and set Y contained the class labels as the learning targets. In this study, supervised learning methods such as support vector machine, decision tree, and artificial neural network were used to construct predictive models for bipolar disorder.
Referring to
Figure 1b, in the second analysis procedure, the KDE method was still used to convert the input data X into the soft label information X
pdf. Next, both X and X
pdf were used as the input attributes of the training dataset, and Y was the set of class labels for learning. Finally, support vector machine, decision tree, and artificial neural network were used to develop predictive models for bipolar disorder.
The application programs for this research work were all written in Python language. The class “sklearn.neighbors.KernelDensity” of the toolkit “scikit-learn” was used to generate the probability density function that represents the distribution of the input data. Both the Gaussian and the exponential kernels were set to the class parameter “kernel” for estimating the population distribution. In addition, the smoothing parameter “bandwidth” was empirically set to 0.2. The module “gensim.models.Word2Vec” implements the word2vec family of algorithms and was used in this study to produce the embedding vectors. When creating the class instance, the parameter “min_count,” which represents the minimum frequency of occurrence of a word, was set to 1. The size of the output vector, i.e., the parameter “vector_size,” was set to 8. The parameter “epochs” represents the number of iterations over the training corpus and was set to 9. Finally, the “sg” parameter of the training algorithm was set to 1, which represents the skip-gram method. All of these parameter settings were selected empirically. On the other hand, the class “sklearn.tree.DecisionTreeClassifier” was used to construct learning models for the decision tree. When building the decision tree instance, the class parameter “criterion” implemented the function to measure the quality of a split and was set to “gini” in this study. We used the default settings for all remaining parameters. Similarly, the class “sklearn.svm.SVC” was used to construct learning models for the support vector machine (SVM). When creating the SVM instance, except for the parameter “kernel,” which was set to “linear,” and the parameter “gamma,” which was set to “auto,” we used the default settings for the remaining class parameters. Finally, the application programming interface “TensorFlow.Keras” was used to construct learning models for the artificial neural network. When building the network instance, we used the activation function “relu” for the hidden layers. In the output layer, we set the activation function to “sigmoid.” For the connection weighting values on the network architecture, we chose the “adam” algorithm and the loss function “binary_crossentropy” for parameter optimization.
3. Results
The datasets of this study were composed of cases of patients with bipolar disorder and the matched control samples, with a ratio of 1:1 and 1:3. The distributions of these data were computed using KDE to produce the corresponding probability density functions as the soft label information for subsequent knowledge distillation. When using a machine learning algorithm to construct the prediction model for bipolar disorder, a randomly selected 80% of data samples was used for model training and validation, and the remaining 20% was used as the test set.
In the following paragraphs of this paper, we define a specific sequence to express the architecture of the neural network. The architecture contained three hidden layers and the number of nodes in each hidden layer was v1, v2, and v3. Then, we used NN (v1, v2, v3, 1) to represent the architecture of this neural network. Since the training dataset only contained cases of patients with bipolar disorder and control samples without any mental illness, all of the learning models constructed in this study were binary predictors. In other words, these learning models were used to predict whether the input data sample was a bipolar disorder patient. Therefore, the last 1 in the sequence represents only one node in the output layer. There were three types of network architecture evaluated in this study: NN (80, 10, 1), NN (160, 40, 1), and NN (80, 20, 10, 1). All of these architectures were tested and verified empirically.
Because the learning models in this study were all binary predictors of bipolar disorder, we adopted the terminology from a confusion matrix: true positive (
TP), true negative (
TN), false positive (
FP), and false negative (
FN). The following metrics were utilized to evaluate the performances of the prediction models trained by various machine learning algorithms.
For the dataset of patients and control samples with a matching ratio of 1:1, their respective probability density functions estimated by KDE are presented in
Figure 2 in the format of a curve chart. Observing the content of
Figure 2, we can see that the respective probability density functions of patients and control samples were quite different. In other words, they exhibited very different data distributions in diagnostic records used as characteristic attributes.
Next, we tried to test whether the data distribution information estimated by KDE was helpful for constructing the learning model. For our first analysis procedure (
Figure 1a), the soft label information X
pdf, which represented the likelihood values of the data distribution of input X, was used as the attributes for training and validating the prediction models. The evaluation results for the test set are shown in
Table 1.
For our second analysis procedure (
Figure 1b), both X and X
pdf were used as the input attributes for training and validating the prediction models. The evaluation results for the test set are shown in
Table 2.
Finally, in order to verify the effectiveness of the soft label information X
pdf, only data X were used as attributes for training and validating the prediction models. The evaluation results for the test set are shown in
Table 3.
Comparing the results shown in
Table 1 and
Table 3, it can be seen that only using the soft label information X
pdf as the input attributes did not always improve the performances of the predictive models. On the other hand, in order to quantitatively evaluate whether the prediction performance of the learning models constructed using both X and X
pdf (
Table 2) was better than that of the models constructed using X only (
Table 3), we adopted the concept of the odds ratio (OR). We focused on the performance of the prediction models in correctly identifying positive samples, that is, bipolar disorder patients. Therefore, the definition formulas of these evaluation metrics are described as follows.
Among these formulas, the variables
TPj,
FPj,
TNj, and
FNj represent the prediction results of a true positive, false positive, true negative, and false negative output, respectively, by model
j. The OR and corresponding 95% confidence interval (CI) values of the evaluation results for when the learning model constructed using both X and X
pdf was regarded as model 1 and the model constructed using X only was regarded as model 2 are presented in
Table 4. It can be observed that most of the learning models constructed using both X and X
pdf performed better in terms of accuracy, PPV, and TPR than models constructed using X only.
In order to confirm that the data distributions generated by KDE played a role in knowledge distillation, we randomly selected patients and matched control samples to form the dataset and repeated this 10 times. Each time we used KDE to generate the soft label data Xpdf, and then the Xpdf data were utilized to train a decision tree. Finally, we examined the decision rules accompanying the tree structure and counted the features in Xpdf most frequently chosen as branching attributes. The disease diagnoses corresponding to these branching attributes based on the descending order of the chosen frequency are listed below.
For decision rules leading to a positive label of bipolar disorder, the most frequent branching attributes included hypertension, depressive disorder, anxiety disorder, suicidal ideations, type II diabetes mellitus, hyperlipidemia, esophageal reflux, chest pain, nicotine dependence, asthma, hypercholesterolemia, hypothyroidism, and alcohol abuse.
For decision rules leading to a negative label of bipolar disorder, the most frequent branching attributes included hypertension, hyperlipidemia, type II diabetes mellitus, chest pain, alcohol abuse, esophageal reflux, atrial fibrillation, hypercholesterolemia, depressive disorder, atherosclerosis/coronary heart disease, abdominal pain, urinary tract infection, hypothyroidism, nicotine dependence, headache, and syncope and collapse.
For the dataset of patients and control samples with a matching ratio of 1:3, the respective probability density functions estimated by KDE are presented in
Figure 3 in the format of a curve chart. Again, it can be seen that patients and control samples exhibited very different data distributions in the diagnostic records.
For this dataset, the evaluation results of the test set on the prediction models of bipolar disorder trained using various learning algorithms are presented in
Table 5,
Table 6 and
Table 7. Comparing the results shown in
Table 5 and
Table 7, the prediction models using the soft label information X
pdf as the input attributes consistently performed worse than models trained using the input data X. However, when comparing the results shown in
Table 6 and
Table 7, using both X and X
pdf as the input attributes for training the prediction models improved all evaluation metrics.
For cases and controls with a matching ratio of 1:3, the OR and corresponding 95% CI values between the learning models constructed using both X and X
pdf (
Table 6) and the models constructed using X only (
Table 7) are presented in
Table 8. Similarly, it can be observed that most of the learning models constructed using both X and X
pdf performed better in terms of accuracy, PPV, and TPR than models constructed using X only. In addition, we found that when the matching ratio of cases and controls was increased to 1:3, the learning model constructed using SVM and X input attributes obviously tended to predict that the sample was negative (
Table 7). Therefore, although this model achieved the best PPV of 0.719, it also had the worst TPR of 0.148. Using soft label information X
pdf can significantly improve the problem caused by the unbalanced proportion of answer categories. Although the PPV values of the SVM models constructed using both X and X
pdf were reduced to 0.640–0.639, their TPR was improved to 0.593–0.583 (
Table 6). This phenomenon verifies that X
pdf does provide information useful for identifying bipolar disorder patients.
Finally, for the dataset of patients and controls with a matching ratio of 1:3, the decision tree analysis mentioned above was executed again. Similarly, we examined the decision rules accompanying the tree structures produced and counted the features in Xpdf most frequently chosen as branching attributes. The disease diagnoses corresponding to these branching attributes are listed below based on the descending order of the chosen frequency.
For decision rules leading to a positive label of bipolar disorder, the most frequent branching attributes included hypertension, depressive disorder, anxiety disorder, suicidal ideations, type II diabetes mellitus, esophageal reflux, hyperlipidemia, nicotine dependence, hypercholesterolemia, asthma, chest pain, hypothyroidism, and atherosclerosis/coronary heart disease.
For decision rules leading to a negative label of bipolar disorder, the most frequent branching attributes included hypertension, hyperlipidemia, type II diabetes mellitus, esophageal reflux, chest pain, depressive disorder, alcohol abuse, hypercholesterolemia, atherosclerosis/coronary heart disease, atrial fibrillation, nicotine dependence, hypothyroidism, chest pain, headache, urinary tract infection, abdominal pain, and syncope and collapse.
4. Discussion
In the evaluation results of this study, the predictive performance of the models trained only with soft label information X
pdf were not always better than that of the models trained with only input data X (
Table 1 vs.
Table 3 and
Table 5 vs.
Table 7). On the other hand, regardless of the matching ratio of patients and control samples, we found that as long as the soft label information X
pdf was combined with the input data X to train the prediction models, the evaluation indicators PPV and TPR for identifying positive test samples were improved. At the same time, the indicators NPV and TNR for identifying negative samples were also advanced (
Table 2 vs.
Table 3 and
Table 6 vs.
Table 7). In order to quantitatively evaluate whether the prediction performance of the learning models constructed using both X and X
pdf was better than that of the models constructed using X only, we computed the OR and corresponding 95% CI values for performance measures of accuracy, PPV, and TPR. For patients and controls with a matching ratio of 1:1, it was found that the OR values of accuracy ranged from 2.045 to 0.996. Similarly, the OR values of PPV ranged from 1.410 to 0.983, and those of TPR ranged from 4.427 to 0.839 (
Table 4). It was verified that most of the learning models constructed using both X and X
pdf performed better in terms of accuracy, PPV, and TPR than models constructed using X only. In other words, the appending use of soft label information X
pdf can improve the accuracy of prediction models in identifying bipolar disorder patients. At the same time, increased PPV values represented fewer false positive samples, whereas increased TPR values represented the identification of more potential positive samples. On the other hand, for cases and controls with a matching ratio of 1:3, it was found that the OR values of accuracy ranged from 2.048 to 1.071. Similarly, the OR values of PPV ranged from 4.181 to 0.691, and those of TPR ranged from 8.362 to 0.889 (
Table 8). Consequently, we can still conclude that most of the learning models constructed using both X and X
pdf performed better than models constructed using X only.
In this study, we used medical history information that was readily available from the electronic health record system to try to improve on the limitation of previous studies that needed to use special instruments to obtain data. In addition, we introduced the concept of knowledge distillation and combined KDE and other machine learning algorithms to train the diagnosis models of bipolar disorder. For patients and controls with a matching ratio of 1:1, the evaluation results of our diagnosis models yielded an accuracy of 0.810–0.659, a PPV of 0.806–0.700, an NPV of 0.846–0.626, a TPR of 0.854–0.524, and a TNR of 0.808–0.705 (
Table 1,
Table 2 and
Table 3). The mean values of these metrics were an accuracy of 0.774, a PPV of 0.764, an NPV of 0.786, a TPR of 0.787, and a TNR of 0.761. Moreover, for patients and controls with a matching ratio of 1:3, the evaluation results of our prediction models yielded an accuracy of 0.894–0.746, a PPV of 0.862–0.493, an NPV of 0.924–0.775, a TPR of 0.773–0.502, and a TNR of 0.981–0.807 (
Table 5,
Table 6 and
Table 7). The mean values of these metrics were an accuracy of 0.808, a PPV of 0.629, an NPV of 0.871, a TPR of 0.610, and a TNR of 0.875. Referring to the review of 33 studies based on machine learning models used to diagnose patients with bipolar disorder, the evaluation measure of accuracy is reported in 24 studies. The values range from 0.98 to 0.64, and the mean value is 0.8206. Moreover, the measure of TPR is reported in 15 studies. The values range from 0.875 to 0.664, and the mean value is 0.7826. Finally, the measure of TNR is reported in 13 studies. The values range from 0.971 to 0.742, and the mean value is 0.854 [
6]. These past studies used different types of data and various machine learning algorithms to construct predictive models for bipolar disorder, and the sizes of the datasets they collected also varied greatly. Therefore, we cannot directly compare the measured values to evaluate predictive performance. However, in the performance metrics of accuracy, TPR, and TNR, the prediction models we constructed obtained mean values close to the results of past studies. There is obvious overlap in the value ranges of these performance measures. It is worth noting that the analysis data of previous studies require special medical instruments to be obtained, but our study instead used medical history information that was readily available from the electronic health record system and achieved similar results of prediction performance. This verifies that we can construct effective identification models for bipolar disorder patients using medical history data and various machine learning algorithms. Furthermore, in view of the fact that most previous studies had sample sizes of less than 1000, we collected tens of thousands of data samples to improve the representativeness of the constructed prediction models.
In order for knowledge distillation to improve the prediction performance of the trained model, the soft label information must provide accurate distribution conditions of the input data. Referring to the research work of G. Hinton et al., they argued that the probability distribution values produced by a deep learning model can be transferred to a shallow “distilled” learning model [
13]. In addition, the research work of A. Korattikara Balan et al., found that the final outcomes of the student network can be thought of as approximating the conditional probabilities provided by the teacher group [
11]. Since it is known that KDE has been applied in estimating the conditional probability distributions of input data when using a naive Bayes classifier [
16,
27], our study was inspired to combine KDE for knowledge distillation to construct prediction models for bipolar disorder. Furthermore, when KDE is used for data analysis, the setting of bandwidth has a great influence on the accurate estimation of data distributions. There have been numerous studies discussing the criteria to set this parameter [
14,
15,
28]. A novel KDE method has been proposed to minimize the bias part of the mean square error and to elevate the bandwidths of the kernel functions to alleviate the effects of variance. It has been verified that this novel KDE can estimate the distributions of input data more accurately than many traditional KDE methods [
29,
30,
31,
32]. Therefore, one of our future works will focus on using this novel KDE for knowledge distillation to construct more accurate predictive models.
In order to further verify the effectiveness of the soft label information X
pdf generated by KDE, we examined the decision rules of the tree structures constructed with X
pdf. Regardless of the matching ratio of patients and control samples, we found that identical disease diagnoses were selected as the branching attributes from the analysis results. The contents contained in X
pdf were not categorical disease descriptions but likelihood values of the probability density functions generated by KDE. Therefore, the features selected as branching attributes in the decision rules needed to be mapped back to the categorical disease descriptions. Since identical disease diagnoses were always selected as the branching attributes, X
pdf did provide correct distribution information of the input data. In addition, through a survey of reference literature, we found various associations between bipolar disorder and the disease diagnoses selected by the decision trees. It is known that 6% of patients with bipolar disorder die by suicide and another 30–40% suffer from self-harm [
1]. Many patients with bipolar disorder also suffer from other mental illnesses, such as anxiety disorders, schizophrenia, substance abuse, etc. Furthermore, one typical symptom of the depressive phase of bipolar disorder is fatigue [
1]. Moreover, some diseases have a higher incidence in patients with bipolar disorder compared to the general population, including metabolic syndrome, migraine, obesity, and type II diabetes [
5]. In addition, compared to the general population, patients with bipolar disorder have twice the risk of dying from coronary heart disease [
1]. Meanwhile, hypertension, hyperlipidemia, hypercholesterolemia, chest pain, etc., are typical risk factors and symptoms of coronary heart disease.
A recent cross-sectional study concluded that a history of asthma is common among patients with bipolar disorder [
33]. Some medical illnesses have clinical presentations similar to symptoms of bipolar disorder, such as migraine headache, hypothyroidism, and hyperthyroidism [
34]. Another study conducted in Sweden found that higher odds for bipolar disorder occurred in patients with gastroesophageal reflux disease [
35]. Furthermore, recently, a genome-wide pleiotropic association study found that the pleiotropic genetic determinants between gastrointestinal tract diseases and bipolar disorder are extensively distributed across the genome [
36]. Moreover, ketamine is mainly used for bipolar disorder, and it has been reported that longstanding ketamine abuse may cause urinary tract infection [
37]. The analysis performed by Adam L Urback et al., concluded that bipolar disorder is associated with cerebrovascular dysfunction, pointing to areas of the brain that are predisposed to cerebrovascular diseases [
38]. The research work of Paul J Harrison et al., showed that bipolar disorder may increase the risk of developing cerebrovascular disease and stroke [
39]. A follow-up assessment of bipolar disorder patients conducted by Sermin Kesebir et al., found that a family history of diabetes mellitus was strongly associated with bipolar disorder and a family history of thyroid disease was correlated with co-occurring anxiety disorders. Finally, this study also observed a co-morbid association between bipolar disorder and cerebrovascular disease [
40].