Using Kernel Density Estimation in Knowledge Distillation to Construct the Prediction Model for Bipolar Disorder Patients

Tseng, Yu-Shiang; Yang, Meng-Han

doi:10.3390/app131810280

Open AccessArticle

Using Kernel Density Estimation in Knowledge Distillation to Construct the Prediction Model for Bipolar Disorder Patients

by

Yu-Shiang Tseng

and

Meng-Han Yang

^*

Department of Computer Science and Information Engineering, National Kaohsiung University of Science and Technology, Kaohsiung City 82445, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10280; https://doi.org/10.3390/app131810280

Submission received: 30 July 2023 / Revised: 7 September 2023 / Accepted: 12 September 2023 / Published: 13 September 2023

(This article belongs to the Special Issue Selected Papers from ISET 2022 and ISPE 2022)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Bipolar disorder is a severe mood disorder and is one of the top 20 causes of disability in the world. Although there have been numerous studies based on machine learning models for the detection of bipolar disorder patients, these works have limitations. This study used a kernel density estimation algorithm to generate distributions of the input data, which can make knowledge distillation work and can improve prediction performances of the machine learning models for bipolar disorder. To the best of our knowledge, this is the first attempt to apply kernel density estimation to knowledge distillation. Another main contribution is that we used medical history information that was readily available from the electronic health record system, trying to improve the limitation of previous studies that needed to use special instruments to collect input data. Furthermore, in view of the fact that most previous studies have sample sizes of less than 1000, we collected tens of thousands of data samples to improve the representativeness of the constructed prediction models. Finally, the generated data distributions helped the decision tree algorithm to select the appropriate branching attributes to construct the prediction models. These branching attributes can be mapped back to specific diseases that are all associated with bipolar disorder.

Keywords:

bipolar disorder; knowledge distillation; kernel density estimation; Medical Information Mart for Intensive Care; support vector machine; decision tree; artificial neural network

1. Introduction

Bipolar disorder is a severe mood disorder characterized by alternating episodes of depression and mania [1,2]. During periods of mania, patients may exhibit unusually energetic, happy, or irritable behavior and have reduced sleep. During depression, patients may cry inexplicably, have a negative attitude toward life, and have poor eye contact with others. According to statistics, 6% of patients with bipolar disorder die by suicide and another 30–40% suffer from self-harm. Many patients with bipolar disorder also suffer from other mental illnesses, such as substance abuse addiction, anxiety disorders, etc. According to academic research, people with bipolar disorder account for about 1% of the global population [3]. In the United States, approximately 3% of the population experiences bipolar symptoms at some point in their lives, with no significant gender differences [4]. The most common age for the onset of symptoms is between 20 and 25 years old. The younger the age, the worse the prognosis [5].

It is known that machine learning provides advanced skills for better diagnosis of illnesses. A recent review aims to explore studies based on machine learning models used to diagnose patients with bipolar disorder [6]. After preprocessing and screening, 33 articles that met the inclusion criteria were identified. Among them, various types of data and machine learning algorithms were used to develop models for diagnosis of bipolar disorder. The accuracy ratios of these studies were very inconsistent, ranging from 0.98 to 0.64. Therefore, there is still potential for improvement in predictive performance of this topic. Among previous studies based on machine learning models for detecting patients with bipolar disorder, only the study analyzed by Paulo J. C. Suen et al., used feature data from the electronic health record system, but their target dataset consisted of only 155 samples [7]. In addition, feature data used in this study included personality characteristics, depression severity, anxiety level, affect scale, etc., all of which require clinical assessments by professional psychologists and psychiatrists. Two supervised learning methods were adopted to train classification models: logistic regression and the XGBoost algorithm. The results of this study showed an accuracy of 0.78–0.57, a true positive rate (TPR) of 0.75–0.50, a true negative rate (TNR) of 0.81–0.64, a positive predictive value (PPV) of 0.69–0.43, and a negative predictive value (NPV) of 0.86–0.70. On the other hand, two previous studies constructed a diagnostic model for bipolar disorder using approaches based on kernel density estimation (KDE). In the research of Benson Mwangi et al., gray and white matter density maps were obtained from neuroimaging scans and were analyzed with the relevance vector machine algorithm. However, the dataset consisted of only 256 samples, and the accuracy ratios of the trained models for detecting bipolar disorder ranged from 0.703 to 0.649. Results of other evaluation metrics included a TNR of 0.742–0.711, a TPR of 0.664–0.586, a PPV of 0.714–0.671, and an NPV of 0.685–0.634 [8]. On the other hand, in the study by Julia O. Linke et al., diffusion tensor imaging data were acquired from 118 participants and were used to train Gaussian process classifiers to identify bipolar disorder patients. Evaluation results of this study included a TPR of 0.682–0.611, a TNR of 0.842–0.591, and an accuracy of 0.754–0.601 [9]. In addition, there are many other studies that tried to identify bipolar disorder patients with machine learning methods [6]. In the study by M.I. El Gohary et al., the models trained with a support vector machine were used to discriminate bipolar disorder patients from control samples based on recording of the electroencephalography rhythms of participants. The evaluation results showed that their prediction model could reach an accuracy of 0.980–0.740, which is the best performance so far. However, only 230 samples were collected in the study, so the validity of the prediction model needs further verification [10]. In summary, since these previous studies used different datasets to construct models for identifying bipolar disorder patients, we cannot directly compare their evaluation results. However, almost all of their feature data need to be obtained through special medical instruments or biochemical tests, such as immune–inflammatory signature, blood samples, magnetic resonance imaging, electroencephalography, genomic data, etc. [6]. Therefore, the lack of easy access to analytical data presents a barrier to the development of machine learning models. On the other hand, among these 33 studies, only 2 of them (6%) had a sample size greater than 2000. The analysis data of all remaining studies contained less than 1000 samples [6]. Theoretically, if the analysis sample size is smaller, the representativeness of the contained data will be limited. This may be why the accuracy ratios of these studies are very inconsistent.

There is an intuitive way to improve the performance of machine learning. Different models can be trained using the same dataset, and their prediction outcomes will then be integrated. Therefore, the idea of “knowledge distillation” has been proposed and has been verified in various studies. For the practice of knowledge distillation, firstly a sophisticated model or multiple models will be trained using any learning algorithm. Outcomes produced by this group of “teacher models” can be thought of as conditional distributions for the input data and may be referred to as “soft labels.” These data distributions can be used as the learning targets for the “student model,” which will be trained using simpler learning architectures [11,12]. On the other hand, the soft labels can be used as the reference information and to train the student model together with the original input data. This process may also be seen as the student model “distilling” the “knowledge” provided by the group of teacher models [13].

Kernel density estimation (KDE), which is a nonparametric estimation approach in statistics, has been widely exploited to identify distributions in various types of datasets. A kernel density estimator generates an approximate probability density function (PDF) by computing the linear combination of the weighted kernel functions placed at the locations of all data instances in the vector space [14,15,16]. Inspired by the aforementioned previous research, we attempted to improve their limitations while developing machine learning models for the diagnosis of bipolar disorder. Firstly, the type of data analyzed in this study is medical history information, which is readily available from the electronic health record system. Secondly, the dataset of this study contain tens of thousands of samples, which is much larger than the sample size of existing studies, thus improving the representativeness of the constructed models. Moreover, based on the concept of knowledge distillation, the PDF values produced by the KDE method were transferred as the soft labels to construct prediction models of bipolar disorder using various machine learning methods. According to the evaluation results, using the data distribution information generated by KDE did indeed improve the predictive performance of the diagnostic model for bipolar disorder. In addition, the branching attributes selected by the decision trees were mapped back to specific disease diagnoses that are all associated with bipolar disorder. To the best of our knowledge, this study is the first attempt to apply KDE to knowledge distillation for supervised machine learning.

2. Materials and Methods

2.1. The Input Data

In the early 2000s, the Laboratory for Computational Physiology of the Massachusetts Institute of Technology (MIT) began to implement the project Integrating Signals, Models and Reasoning in Critical Care. The main goal of this project is to build a large dataset for research based on intensive care, the result of which is the Medical Information Mart for Intensive Care (MIMIC) database. The contents of this database come from Beth Israel Deaconess Medical Center (BIDMC). MIMIC is a publicly shared medical database. It contains de-identified information from electronic medical records for thousands of adult patients admitted to medical/surgical intensive care units and emergency wards. The development of this database was approved by the ethical review boards of BIDMC and MIT. MIMIC has been used extensively by academic researchers around the world to help promote advances in clinical informatics, epidemiology, and machine learning [17].

In the database tables of MIMIC, all of the information of a single patient is concatenated with the field value of “subject_id.” In this case-control study, the case group included patients with bipolar disorder and/or related symptoms. The following diagnostic codes were used when selecting case samples from the “diagnoses_icd” table. The ICD-9 versions were 296.40~296.45, 296.50~296.56, 296.60~296.62, and 295, 298, and the ICD-10 versions were F20, F29, and F31. Then, 10,000 people were randomly selected from these bipolar disorder patients to form the case group. The date that bipolar disorder was first diagnosed for each patient, i.e., the field value of “admittime,” was regarded as the index date. Finally, for each patient, the subject_id was used to retrieve all of his/her diagnosis records in the database.

On the other hand, the control sample did not have diagnoses of bipolar disorder or any associated symptoms in the database. They were matched with patients in terms of age and gender, i.e., the field values of “gender” and “anchor_age” from the “patients” table. In addition, in the month of the index date for a patient, the corresponding control sample needed to have a diagnosis record that represented similar health status. Based on the aforementioned matching conditions, this study selected the control samples at a ratio of 1 to 1 (i.e., 10,000 samples) and 1 to 3 (i.e., 30,000 samples). Finally, for each control sample, the subject_id was used to retrieve all of his/her diagnosis records from the database to form the input data.

2.2. Kernel Density Estimation

Kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights [14,15]. KDE answers a fundamental data-smoothing problem where distributions about the population are carried out [16]. For the basic definition of KDE, let (x₁, x₂, …, x_n) be independent and identically distributed samples drawn from a specific distribution with an unknown density f at any given point x. Its kernel density estimator can be defined using Formula (1).

\hat{f_{h}} (x) = \frac{1}{n h} \cdot \sum_{i = 1}^{n} K (x - x_{i}; h)

(1)

In Formula (1), K(x − x_i; h) is the kernel function, whose outcomes are non-negative values. There exists a range of kernel functions being used, such as cosine, linear, normal, etc. [14,15]. The positive variable h is called the bandwidth, which is a smoothing parameter and exhibits a strong influence on the resulting estimation.

2.3. Embedding Vector

In the application of machine learning, the content of category data needs to be converted to a special format before subsequent analyses can be performed. The idea is that an embedding vector will present a categorical data item (such as a word in a text) in the form of a multi-dimensional vector. Each element of the vector is a real number, and the contents of the vector can reveal the properties of the original data items [18]. The embedding vector can be generated by the parameter optimization mechanism using a specific neural network architecture [19,20]. As for the loss function required in the learning process, its basic concepts are defined as Formula (2).

P (w_{i - m}, \dots, w_{i - 1}, w_{i + 1}, \dots, w_{i + m} | w_{i}) = \prod_{j \neq i & j = i - m}^{i + m} P (w_{j} | w_{i})

(2)

Formula (2) represents the conditional probability of correctly judging the context (i.e., m words before and after w_i, which constitute contents in the sliding window as w_i_−m, ……, w_i₋₁, w_i₊₁, ……, w_i_+m) with the word vector w_i as the input premise. The probability value can be increased as much as possible through the parameter optimization mechanism. Then sum of the conditional probability values of all of the words in the full text (e.g., a total of N words) and the logarithm function are used to simplify the computation process. The expected loss function is shown in Formula (3).

J (θ) = - (\frac{1}{N - 2 m}) \sum_{i = m + 1}^{N - m} \sum_{j \neq i & j = i - m}^{i + m} \log (P (w_{j} | w_{i}))

(3)

When implementing the program suite of this loss function, the data structure of the Huffman tree can be used to improve the computational performance.

The “word2vec” proposed by Google in 2013 is currently the mainstream embedding vector algorithm [19,20]. The algorithm combines two learning mechanisms: skip-gram and CBOW (continuous bag of words). In the calculation of skip-gram, the word vector w_i is used as the input premise, and the predictions of m word vectors before and after w_i, which constitute contents in the sliding window as w_i_−m, ……, w_i₋₁, w_i₊₁, ……, w_i_+m, are produced. On the other hand, in the computation of CBOW, the 2m word vectors within the sliding window, i.e., w_i_−m, ……, w_i₋₁, w_i₊₁, ……, w_i_+m, are used as the input premises, and the prediction of the word vector w_i is output.

2.4. Machine Learning Algorithms

The support vector machine (SVM) is a supervised learning model that analyzes data for classification or regression [21]. Supposing the analysis data can be viewed as a vector point in a multi-dimensional space, SVM tries to construct the hyperplane as the discriminator for data categorization. However, since the data points may not be linearly separable in the original space, they can be projected to another multi-dimensional space where a good categorization is achieved by the hyperplane with the largest functional margin to any data point of any class. When providing a dataset of n points (x₁, y₁), …, (x_n, y_n), where x_i is a multi-dimensional vector and y_i is the class label, the hyperplane can be written as the following conditional formula.

y_i∙(w^Tφ(x_i) − b) ≥ 1 for i = 1, …, n

(4)

In this formula, w is the normal vector of the hyperplane, which is constructed by data points called support vectors. φ(x_i) is the projected data points and b is the computational bias. Satisfying the condition of this formula means that the class label of xi is correctly predicted. On the other hand, both w and φ(x_i) are in the projected multi-dimensional space, and the inner product calculation between them can be replaced by a specific kernel function, such as the linear kernel, the radial basis function (RBF) kernel, etc. [22].

w^{T} \cdot φ (x_{i}) = \sum_{x_{j} \in support vectors}^{} α_{j} \cdot y_{j} \cdot k (x_{j}, x_{i})

(5)

The decision tree is a hierarchical model that uses a tree-like structure. In this model, each internal node represents a test on an attribute, and each branch represents the outcome of the test. At the bottom of the structure, each leaf node represents a class label, which is the decision made after analyzing all of the attribute features [23]. The path from the root node to a leaf represents a specific decision rule, and the conditions along the path form a conjunction of “if–then” clauses [24]. The decision tree is a white-box model because the decision rules produced are easy to understand and interpret. Among various types of node-branching functions, the Gini impurity is constantly used and was chosen for this study. According to the relative frequencies of class labels in the dataset, the Gini impurity measures how often a data item will be incorrectly labeled if it is labeled randomly and independently. For a dataset of items with J class labels and relative frequencies p_i, i ∈ {1, 2, …, J}, the probability of correctly recognizing the class label of a data item, assuming it is class i, is p_i. On the contrary, the probability of misclassifying that item is

\sum_{k \neq i}^{} p_{k} = (1 - p_{i})

. Therefore, the computation formula for the Gini impurity I_G(p) is defined as follows.

I_{G} (p) = \sum_{i = 1}^{J} (p_{i} \cdot \sum_{k \neq i}^{} p_{k}) = \sum_{i = 1}^{J} p_{i} \cdot (1 - p_{i}) = \sum_{i = 1}^{J} p_{i} - \sum_{i = 1}^{J} p_{i}^{2} = 1 - \sum_{i = 1}^{J} p_{i}^{2}

(6)

I_G(p) reaches the minimum value of zero when all data items in the node fall into a single class label.

An artificial neural network is a machine learning algorithm that imitates the human nervous system, and its definition formula is as follows [25,26].

Y = \emptyset (W \times X + B)

(7)

Because the neural network can have a plurality of input and output neurons, they will be assembled respectively into the “input layer” and the “output layer.” Matrix X represents the input values of a set of attributes, and matrix Y simulates the output neurons for the computation results. The weight matrix W simulates the axons, which connect the input/output neurons and are responsible for transmitting messages. In the application problem, this represents the respective influences of different attribute characteristics. Matrix B of the bias values simulates synapses and represents the degree to which the output neurons are activated. The symbol

\emptyset

represents the activation function, which accepts a weighted sum of input values and performs a special calculation. If the resulting value is greater than the threshold, the output neuron is activated and the message is transmitted. In addition, a “hidden layer” can be added to the network architecture, which contains nodes that mimic internal neurons. Since the hidden layer makes the network structure more complicated, it can handle more kinds of application problems or simulate the interaction of more complex attribute features.

2.5. The Analysis Procedure

This study used the concept of knowledge distillation to construct predictive models of bipolar disorder. After the patients and control samples were screened from the MIMIC database, all of their diagnosis records in the database were selected as the input data. In the MIMIC database, an average of 20 different disease diagnoses were recorded for each sample. Using the aforementioned word2vec algorithm, these disease diagnoses were converted into 8-dimensional embedding vectors. Therefore, the input data of each sample was stored in a 20 × 8 matrix structure. The research team then planned two analysis procedures as follows.

Referring to Figure 1a, in the first procedure, KDE, was used to estimate the probability density function, representing the distribution for the input data X. After the X data were input into the density function, the soft label information X_pdf was produced, which represented the likelihood values of the data distribution of input X. Next, X_pdf was used as the input attributes of the training dataset, and set Y contained the class labels as the learning targets. In this study, supervised learning methods such as support vector machine, decision tree, and artificial neural network were used to construct predictive models for bipolar disorder.

Referring to Figure 1b, in the second analysis procedure, the KDE method was still used to convert the input data X into the soft label information X_pdf. Next, both X and X_pdf were used as the input attributes of the training dataset, and Y was the set of class labels for learning. Finally, support vector machine, decision tree, and artificial neural network were used to develop predictive models for bipolar disorder.

The application programs for this research work were all written in Python language. The class “sklearn.neighbors.KernelDensity” of the toolkit “scikit-learn” was used to generate the probability density function that represents the distribution of the input data. Both the Gaussian and the exponential kernels were set to the class parameter “kernel” for estimating the population distribution. In addition, the smoothing parameter “bandwidth” was empirically set to 0.2. The module “gensim.models.Word2Vec” implements the word2vec family of algorithms and was used in this study to produce the embedding vectors. When creating the class instance, the parameter “min_count,” which represents the minimum frequency of occurrence of a word, was set to 1. The size of the output vector, i.e., the parameter “vector_size,” was set to 8. The parameter “epochs” represents the number of iterations over the training corpus and was set to 9. Finally, the “sg” parameter of the training algorithm was set to 1, which represents the skip-gram method. All of these parameter settings were selected empirically. On the other hand, the class “sklearn.tree.DecisionTreeClassifier” was used to construct learning models for the decision tree. When building the decision tree instance, the class parameter “criterion” implemented the function to measure the quality of a split and was set to “gini” in this study. We used the default settings for all remaining parameters. Similarly, the class “sklearn.svm.SVC” was used to construct learning models for the support vector machine (SVM). When creating the SVM instance, except for the parameter “kernel,” which was set to “linear,” and the parameter “gamma,” which was set to “auto,” we used the default settings for the remaining class parameters. Finally, the application programming interface “TensorFlow.Keras” was used to construct learning models for the artificial neural network. When building the network instance, we used the activation function “relu” for the hidden layers. In the output layer, we set the activation function to “sigmoid.” For the connection weighting values on the network architecture, we chose the “adam” algorithm and the loss function “binary_crossentropy” for parameter optimization.

3. Results

The datasets of this study were composed of cases of patients with bipolar disorder and the matched control samples, with a ratio of 1:1 and 1:3. The distributions of these data were computed using KDE to produce the corresponding probability density functions as the soft label information for subsequent knowledge distillation. When using a machine learning algorithm to construct the prediction model for bipolar disorder, a randomly selected 80% of data samples was used for model training and validation, and the remaining 20% was used as the test set.

In the following paragraphs of this paper, we define a specific sequence to express the architecture of the neural network. The architecture contained three hidden layers and the number of nodes in each hidden layer was v1, v2, and v3. Then, we used NN (v1, v2, v3, 1) to represent the architecture of this neural network. Since the training dataset only contained cases of patients with bipolar disorder and control samples without any mental illness, all of the learning models constructed in this study were binary predictors. In other words, these learning models were used to predict whether the input data sample was a bipolar disorder patient. Therefore, the last 1 in the sequence represents only one node in the output layer. There were three types of network architecture evaluated in this study: NN (80, 10, 1), NN (160, 40, 1), and NN (80, 20, 10, 1). All of these architectures were tested and verified empirically.

Because the learning models in this study were all binary predictors of bipolar disorder, we adopted the terminology from a confusion matrix: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). The following metrics were utilized to evaluate the performances of the prediction models trained by various machine learning algorithms.

accuracy = \frac{(T P + T N)}{(T P + F N + T N + F P)}

true positive rate (TPR) = \frac{T P}{(T P + F N)}

positive predictive value (PPV) = \frac{T P}{(T P + F P)}

true negative rate (TNR) = \frac{T N}{(T N + F P)}

negative predictive value (NPV) = \frac{T N}{(T N + F N)}

For the dataset of patients and control samples with a matching ratio of 1:1, their respective probability density functions estimated by KDE are presented in Figure 2 in the format of a curve chart. Observing the content of Figure 2, we can see that the respective probability density functions of patients and control samples were quite different. In other words, they exhibited very different data distributions in diagnostic records used as characteristic attributes.

Next, we tried to test whether the data distribution information estimated by KDE was helpful for constructing the learning model. For our first analysis procedure (Figure 1a), the soft label information X_pdf, which represented the likelihood values of the data distribution of input X, was used as the attributes for training and validating the prediction models. The evaluation results for the test set are shown in Table 1.

For our second analysis procedure (Figure 1b), both X and X_pdf were used as the input attributes for training and validating the prediction models. The evaluation results for the test set are shown in Table 2.

Finally, in order to verify the effectiveness of the soft label information X_pdf, only data X were used as attributes for training and validating the prediction models. The evaluation results for the test set are shown in Table 3.

Comparing the results shown in Table 1 and Table 3, it can be seen that only using the soft label information X_pdf as the input attributes did not always improve the performances of the predictive models. On the other hand, in order to quantitatively evaluate whether the prediction performance of the learning models constructed using both X and X_pdf (Table 2) was better than that of the models constructed using X only (Table 3), we adopted the concept of the odds ratio (OR). We focused on the performance of the prediction models in correctly identifying positive samples, that is, bipolar disorder patients. Therefore, the definition formulas of these evaluation metrics are described as follows.

OR for PPV = \frac{(\frac{T P_{1}}{F P_{1}})}{(\frac{T P_{2}}{F P_{2}})}

OR for TPR = \frac{(\frac{T P_{1}}{F N_{1}})}{(\frac{T P_{2}}{F N_{2}})}

OR for accuracy = \frac{(\frac{(T P_{1} + T N_{1})}{(F P_{1} + F N_{1})})}{(\frac{(T P_{2} + T N_{2})}{(F P_{2} + F N_{2})})}

Among these formulas, the variables TP_j, FP_j, TN_j, and FN_j represent the prediction results of a true positive, false positive, true negative, and false negative output, respectively, by model j. The OR and corresponding 95% confidence interval (CI) values of the evaluation results for when the learning model constructed using both X and X_pdf was regarded as model 1 and the model constructed using X only was regarded as model 2 are presented in Table 4. It can be observed that most of the learning models constructed using both X and X_pdf performed better in terms of accuracy, PPV, and TPR than models constructed using X only.

In order to confirm that the data distributions generated by KDE played a role in knowledge distillation, we randomly selected patients and matched control samples to form the dataset and repeated this 10 times. Each time we used KDE to generate the soft label data X_pdf, and then the X_pdf data were utilized to train a decision tree. Finally, we examined the decision rules accompanying the tree structure and counted the features in X_pdf most frequently chosen as branching attributes. The disease diagnoses corresponding to these branching attributes based on the descending order of the chosen frequency are listed below.

For decision rules leading to a positive label of bipolar disorder, the most frequent branching attributes included hypertension, depressive disorder, anxiety disorder, suicidal ideations, type II diabetes mellitus, hyperlipidemia, esophageal reflux, chest pain, nicotine dependence, asthma, hypercholesterolemia, hypothyroidism, and alcohol abuse.

For decision rules leading to a negative label of bipolar disorder, the most frequent branching attributes included hypertension, hyperlipidemia, type II diabetes mellitus, chest pain, alcohol abuse, esophageal reflux, atrial fibrillation, hypercholesterolemia, depressive disorder, atherosclerosis/coronary heart disease, abdominal pain, urinary tract infection, hypothyroidism, nicotine dependence, headache, and syncope and collapse.

For the dataset of patients and control samples with a matching ratio of 1:3, the respective probability density functions estimated by KDE are presented in Figure 3 in the format of a curve chart. Again, it can be seen that patients and control samples exhibited very different data distributions in the diagnostic records.

For this dataset, the evaluation results of the test set on the prediction models of bipolar disorder trained using various learning algorithms are presented in Table 5, Table 6 and Table 7. Comparing the results shown in Table 5 and Table 7, the prediction models using the soft label information X_pdf as the input attributes consistently performed worse than models trained using the input data X. However, when comparing the results shown in Table 6 and Table 7, using both X and X_pdf as the input attributes for training the prediction models improved all evaluation metrics.

For cases and controls with a matching ratio of 1:3, the OR and corresponding 95% CI values between the learning models constructed using both X and X_pdf (Table 6) and the models constructed using X only (Table 7) are presented in Table 8. Similarly, it can be observed that most of the learning models constructed using both X and X_pdf performed better in terms of accuracy, PPV, and TPR than models constructed using X only. In addition, we found that when the matching ratio of cases and controls was increased to 1:3, the learning model constructed using SVM and X input attributes obviously tended to predict that the sample was negative (Table 7). Therefore, although this model achieved the best PPV of 0.719, it also had the worst TPR of 0.148. Using soft label information X_pdf can significantly improve the problem caused by the unbalanced proportion of answer categories. Although the PPV values of the SVM models constructed using both X and X_pdf were reduced to 0.640–0.639, their TPR was improved to 0.593–0.583 (Table 6). This phenomenon verifies that X_pdf does provide information useful for identifying bipolar disorder patients.

Finally, for the dataset of patients and controls with a matching ratio of 1:3, the decision tree analysis mentioned above was executed again. Similarly, we examined the decision rules accompanying the tree structures produced and counted the features in X_pdf most frequently chosen as branching attributes. The disease diagnoses corresponding to these branching attributes are listed below based on the descending order of the chosen frequency.

For decision rules leading to a positive label of bipolar disorder, the most frequent branching attributes included hypertension, depressive disorder, anxiety disorder, suicidal ideations, type II diabetes mellitus, esophageal reflux, hyperlipidemia, nicotine dependence, hypercholesterolemia, asthma, chest pain, hypothyroidism, and atherosclerosis/coronary heart disease.

For decision rules leading to a negative label of bipolar disorder, the most frequent branching attributes included hypertension, hyperlipidemia, type II diabetes mellitus, esophageal reflux, chest pain, depressive disorder, alcohol abuse, hypercholesterolemia, atherosclerosis/coronary heart disease, atrial fibrillation, nicotine dependence, hypothyroidism, chest pain, headache, urinary tract infection, abdominal pain, and syncope and collapse.

4. Discussion

In the evaluation results of this study, the predictive performance of the models trained only with soft label information X_pdf were not always better than that of the models trained with only input data X (Table 1 vs. Table 3 and Table 5 vs. Table 7). On the other hand, regardless of the matching ratio of patients and control samples, we found that as long as the soft label information X_pdf was combined with the input data X to train the prediction models, the evaluation indicators PPV and TPR for identifying positive test samples were improved. At the same time, the indicators NPV and TNR for identifying negative samples were also advanced (Table 2 vs. Table 3 and Table 6 vs. Table 7). In order to quantitatively evaluate whether the prediction performance of the learning models constructed using both X and X_pdf was better than that of the models constructed using X only, we computed the OR and corresponding 95% CI values for performance measures of accuracy, PPV, and TPR. For patients and controls with a matching ratio of 1:1, it was found that the OR values of accuracy ranged from 2.045 to 0.996. Similarly, the OR values of PPV ranged from 1.410 to 0.983, and those of TPR ranged from 4.427 to 0.839 (Table 4). It was verified that most of the learning models constructed using both X and X_pdf performed better in terms of accuracy, PPV, and TPR than models constructed using X only. In other words, the appending use of soft label information X_pdf can improve the accuracy of prediction models in identifying bipolar disorder patients. At the same time, increased PPV values represented fewer false positive samples, whereas increased TPR values represented the identification of more potential positive samples. On the other hand, for cases and controls with a matching ratio of 1:3, it was found that the OR values of accuracy ranged from 2.048 to 1.071. Similarly, the OR values of PPV ranged from 4.181 to 0.691, and those of TPR ranged from 8.362 to 0.889 (Table 8). Consequently, we can still conclude that most of the learning models constructed using both X and X_pdf performed better than models constructed using X only.

In this study, we used medical history information that was readily available from the electronic health record system to try to improve on the limitation of previous studies that needed to use special instruments to obtain data. In addition, we introduced the concept of knowledge distillation and combined KDE and other machine learning algorithms to train the diagnosis models of bipolar disorder. For patients and controls with a matching ratio of 1:1, the evaluation results of our diagnosis models yielded an accuracy of 0.810–0.659, a PPV of 0.806–0.700, an NPV of 0.846–0.626, a TPR of 0.854–0.524, and a TNR of 0.808–0.705 (Table 1, Table 2 and Table 3). The mean values of these metrics were an accuracy of 0.774, a PPV of 0.764, an NPV of 0.786, a TPR of 0.787, and a TNR of 0.761. Moreover, for patients and controls with a matching ratio of 1:3, the evaluation results of our prediction models yielded an accuracy of 0.894–0.746, a PPV of 0.862–0.493, an NPV of 0.924–0.775, a TPR of 0.773–0.502, and a TNR of 0.981–0.807 (Table 5, Table 6 and Table 7). The mean values of these metrics were an accuracy of 0.808, a PPV of 0.629, an NPV of 0.871, a TPR of 0.610, and a TNR of 0.875. Referring to the review of 33 studies based on machine learning models used to diagnose patients with bipolar disorder, the evaluation measure of accuracy is reported in 24 studies. The values range from 0.98 to 0.64, and the mean value is 0.8206. Moreover, the measure of TPR is reported in 15 studies. The values range from 0.875 to 0.664, and the mean value is 0.7826. Finally, the measure of TNR is reported in 13 studies. The values range from 0.971 to 0.742, and the mean value is 0.854 [6]. These past studies used different types of data and various machine learning algorithms to construct predictive models for bipolar disorder, and the sizes of the datasets they collected also varied greatly. Therefore, we cannot directly compare the measured values to evaluate predictive performance. However, in the performance metrics of accuracy, TPR, and TNR, the prediction models we constructed obtained mean values close to the results of past studies. There is obvious overlap in the value ranges of these performance measures. It is worth noting that the analysis data of previous studies require special medical instruments to be obtained, but our study instead used medical history information that was readily available from the electronic health record system and achieved similar results of prediction performance. This verifies that we can construct effective identification models for bipolar disorder patients using medical history data and various machine learning algorithms. Furthermore, in view of the fact that most previous studies had sample sizes of less than 1000, we collected tens of thousands of data samples to improve the representativeness of the constructed prediction models.

In order for knowledge distillation to improve the prediction performance of the trained model, the soft label information must provide accurate distribution conditions of the input data. Referring to the research work of G. Hinton et al., they argued that the probability distribution values produced by a deep learning model can be transferred to a shallow “distilled” learning model [13]. In addition, the research work of A. Korattikara Balan et al., found that the final outcomes of the student network can be thought of as approximating the conditional probabilities provided by the teacher group [11]. Since it is known that KDE has been applied in estimating the conditional probability distributions of input data when using a naive Bayes classifier [16,27], our study was inspired to combine KDE for knowledge distillation to construct prediction models for bipolar disorder. Furthermore, when KDE is used for data analysis, the setting of bandwidth has a great influence on the accurate estimation of data distributions. There have been numerous studies discussing the criteria to set this parameter [14,15,28]. A novel KDE method has been proposed to minimize the bias part of the mean square error and to elevate the bandwidths of the kernel functions to alleviate the effects of variance. It has been verified that this novel KDE can estimate the distributions of input data more accurately than many traditional KDE methods [29,30,31,32]. Therefore, one of our future works will focus on using this novel KDE for knowledge distillation to construct more accurate predictive models.

In order to further verify the effectiveness of the soft label information X_pdf generated by KDE, we examined the decision rules of the tree structures constructed with X_pdf. Regardless of the matching ratio of patients and control samples, we found that identical disease diagnoses were selected as the branching attributes from the analysis results. The contents contained in X_pdf were not categorical disease descriptions but likelihood values of the probability density functions generated by KDE. Therefore, the features selected as branching attributes in the decision rules needed to be mapped back to the categorical disease descriptions. Since identical disease diagnoses were always selected as the branching attributes, X_pdf did provide correct distribution information of the input data. In addition, through a survey of reference literature, we found various associations between bipolar disorder and the disease diagnoses selected by the decision trees. It is known that 6% of patients with bipolar disorder die by suicide and another 30–40% suffer from self-harm [1]. Many patients with bipolar disorder also suffer from other mental illnesses, such as anxiety disorders, schizophrenia, substance abuse, etc. Furthermore, one typical symptom of the depressive phase of bipolar disorder is fatigue [1]. Moreover, some diseases have a higher incidence in patients with bipolar disorder compared to the general population, including metabolic syndrome, migraine, obesity, and type II diabetes [5]. In addition, compared to the general population, patients with bipolar disorder have twice the risk of dying from coronary heart disease [1]. Meanwhile, hypertension, hyperlipidemia, hypercholesterolemia, chest pain, etc., are typical risk factors and symptoms of coronary heart disease.

A recent cross-sectional study concluded that a history of asthma is common among patients with bipolar disorder [33]. Some medical illnesses have clinical presentations similar to symptoms of bipolar disorder, such as migraine headache, hypothyroidism, and hyperthyroidism [34]. Another study conducted in Sweden found that higher odds for bipolar disorder occurred in patients with gastroesophageal reflux disease [35]. Furthermore, recently, a genome-wide pleiotropic association study found that the pleiotropic genetic determinants between gastrointestinal tract diseases and bipolar disorder are extensively distributed across the genome [36]. Moreover, ketamine is mainly used for bipolar disorder, and it has been reported that longstanding ketamine abuse may cause urinary tract infection [37]. The analysis performed by Adam L Urback et al., concluded that bipolar disorder is associated with cerebrovascular dysfunction, pointing to areas of the brain that are predisposed to cerebrovascular diseases [38]. The research work of Paul J Harrison et al., showed that bipolar disorder may increase the risk of developing cerebrovascular disease and stroke [39]. A follow-up assessment of bipolar disorder patients conducted by Sermin Kesebir et al., found that a family history of diabetes mellitus was strongly associated with bipolar disorder and a family history of thyroid disease was correlated with co-occurring anxiety disorders. Finally, this study also observed a co-morbid association between bipolar disorder and cerebrovascular disease [40].

5. Conclusions

To sum up, this study used the KDE algorithm to generate soft label information of the input data, which made knowledge distillation work and improved the performance of prediction models for bipolar disorder. To the best of our knowledge, this study is the first attempt to apply KDE to knowledge distillation for supervised machine learning. Another contribution of this study is that we used medical history information readily available from the electronic health record system to try to improve on the limitation of previous studies that needed to use special medical instruments to collect data. Furthermore, in view of the fact that most previous studies had sample sizes of less than 1000 data samples, we collected tens of thousands of data samples to improve the representativeness of the constructed prediction models. Finally, the soft label information generated by KDE provided correct data distributions, so they helped the decision tree algorithm select appropriate branching attributes to construct the prediction models. These branching attributes could be mapped back to specific disease diagnoses that are all associated with bipolar disorder. In conclusion, the KDE algorithm provided correct information of data distributions, and this information was applied to knowledge distillation to improve prediction models of bipolar disorder.

Our study still has shortcomings that need to be improved. When using several machine learning algorithms to construct the diagnostic models of bipolar disorder patients, we set the parameter contents empirically or adopted the default parameter settings of the tool functions used. Therefore, one future prospect for research works is that we should use the systematic optimization mechanisms to select the appropriate parameter settings. Second, we need to collect more data to increase the sample size, as well as collect different data types. Since bipolar disorder occurs in approximately 1% of the global population [3], we must also adopt methods to deal with the imbalance of answer labels, e.g., bipolar disorder patients vs. healthy controls. Finally, considering the models constructed in this study to identify bipolar disorder patients, their predictive performance still has potential for improvement. Therefore, we plan to pursue the following works: selecting drug prescription information from the electronic health record system as feature attributes, using different machine learning algorithms to construct predictive models, and using advanced KDE methods to provide more accurate data distribution information [29,30,31,32].

Author Contributions

Conceptualization, M.-H.Y.; methodology, Y.-S.T. and M.-H.Y.; software, Y.-S.T.; formal analysis, Y.-S.T.; data curation, Y.-S.T.; writing—original draft preparation, Y.-S.T. and M.-H.Y.; writing—review and editing, M.-H.Y.; supervision, M.-H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

MIMIC-III Clinical Database: https://doi.org/10.13026/C2XW26. MIMIC-IV: https://doi.org/10.13026/6mm1-ek67 (accessed on 29 July 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Anderson, I.M.; Haddad, P.M.; Scott, J. Bipolar disorder. Bmj 2012, 345, e8508. [Google Scholar] [CrossRef] [PubMed]
Association, A.P. Diagnostic and Statistical Manual of Mental Disorders: DSM-5, 5th ed.; American Psychiatric Publishing: Arlington, VA, USA, 2013; p. 991. [Google Scholar]
Grande, I.; Berk, M.; Birmaher, B.; Vieta, E. Bipolar disorder. Lancet 2016, 387, 1561–1572. [Google Scholar] [CrossRef] [PubMed]
Schmitt, A.; Malchow, B.; Hasan, A.; Falkai, P. The impact of environmental factors in severe psychiatric disorders. Front. Neurosci. 2014, 8, 19. [Google Scholar] [CrossRef]
Carvalho, A.F.; Firth, J.; Vieta, E. Bipolar Disorder. N. Engl. J. Med. 2020, 383, 58–66. [Google Scholar] [CrossRef]
Jan, Z.; Ai-Ansari, N.; Mousa, O.; Abd-Alrazaq, A.; Ahmed, A.; Alam, T.; Househ, M. The Role of Machine Learning in Diagnosing Bipolar Disorder: Scoping Review. J. Med. Internet Res. 2021, 23, e29749. [Google Scholar] [CrossRef]
Suen, P.J.C.; Goerigk, S.; Razza, L.B.; Padberg, F.; Passos, I.C.; Brunoni, A.R. Classification of unipolar and bipolar depression using machine learning techniques. Psychiatry Res. 2021, 295, 113624. [Google Scholar] [CrossRef]
Mwangi, B.; Wu, M.J.; Cao, B.; Passos, I.C.; Lavagnino, L.; Keser, Z.; Zunta-Soares, G.B.; Hasan, K.M.; Kapczinski, F.; Soares, J.C. Individualized Prediction and Clinical Staging of Bipolar Disorders using Neuroanatomical Biomarkers. Biol. Psychiatry Cogn. Neurosci. Neuroimaging 2016, 1, 186–194. [Google Scholar] [CrossRef]
Linke, J.O.; Adleman, N.E.; Sarlls, J.; Ross, A.; Perlstein, S.; Frank, H.R.; Towbin, K.E.; Pine, D.S.; Leibenluft, E.; Brotman, M.A. White Matter Microstructure in Pediatric Bipolar Disorder and Disruptive Mood Dysregulation Disorder. J. Am. Acad. Child Adolesc. Psychiatry 2020, 59, 1135–1145. [Google Scholar] [CrossRef] [PubMed]
Gohary, M.I.E.; Zohairy, T.A.A.; Eissa, A.M.; Deghaidy, S.E.; Hussein, H.M. An intelligent System for Diagnosis of Schizophrenia and Bipolar Diseases using Support Vector Machine with Different Kernels. Int. J. Eng. Appl. Sci. 2016, 3, 36–40. [Google Scholar]
Korattikara Balan, A.; Rathod, V.; Murphy, K.P.; Welling, M. Bayesian dark knowledge. In Proceedings of the Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 3438–3446. [Google Scholar]
Ba, J.; Caruana, R. Do Deep Nets Really Need to be Deep? In Proceedings of the Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2018; pp. 2654–2662. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. In Proceedings of the Conference on Neural Information Processing Systems, Deep Learning and Representation Learning Workshop, Montreal, QC, Canada, 7–12 December 2023. [Google Scholar]
Parzen, E. On Estimation of a Probability Density Function and Mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
Rosenblatt, M. Remarks on Some Nonparametric Estimates of a Density Function. Ann. Math. Stat. 1956, 27, 832–837. [Google Scholar] [CrossRef]
Piryonesi, S.M.; El-Diraby, T. Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems. J. Transp. Eng. Part B Pavements 2020, 146, 04020022. [Google Scholar] [CrossRef]
Johnson, A.E.W.; Pollard, T.J.; Shen, L.; Lehman, L.-W.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef]
Jurafsky, D.; Martin, J.H.; Kehler, A.; Linden, K.V.; Ward, N. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, 1st ed.; Prentice Hall: Hoboken, NJ, USA, 2000; p. 934. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 2. pp. 3111–3119. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.s.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the Workshop at International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2013. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar]
Winterfeldt, D.V.; Edwards, W. Decision Analysis and Behavioral Research, 1st ed.; Cambridge University Press: Cambridge, UK, 1986. [Google Scholar]
Quinlan, J.R. Simplifying decision trees. Int. J. Man-Mach. Stud. 1987, 27, 221–234. [Google Scholar] [CrossRef]
Jain, A.K.; Jianchang, M.; Mohiuddin, K.M. Artificial neural networks: A tutorial. Computer 1996, 29, 31–44. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; (Springer Series in Statistics); Springer: Berlin/Heidelberg, Germany, 2003; p. 552. [Google Scholar]
Jones, M.C.; Marron, J.S.; Sheather, S.J. A Brief Survey of Bandwidth Selection for Density Estimation. J. Am. Stat. Assoc. 1996, 91, 401–407. [Google Scholar] [CrossRef]
Oyang, Y.-J.; Ou, Y.-Y.; Hwang, S.-C.; Chen, C.-Y.; Chang, T.-H. Data classification with a relaxed model of variable kernel density estimation. In Proceedings of the IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; Volume 2835. pp. 2831–2836. [Google Scholar]
Oyang, Y.-J.; Hwang, S.-C.; Ou, Y.-Y.; Chen, C.-Y.; Chen, Z.-W. Data classification with radial basis function networks based on a novel kernel density estimation algorithm. IEEE Trans. Neural Netw. 2005, 16, 225–236. [Google Scholar] [CrossRef] [PubMed]
Yang, C.-C. Kernel Density Based Probability Estimation for Data Classifiers. Master’s Thesis, National Taiwan University, Taipei City, Taiwan, 2019. [Google Scholar]
Liu, R.-J. A Study on Optimal Bandwidth Settings for Adaptive Kernel Density Estimation. Master’s Thesis, National Taiwan University, Taipei City, Taiwan, 2022. [Google Scholar]
Romo-Nava, F.; Blom, T.; Cuellar-Barboza, A.B.; Barrera, F.J.; Miola, A.; Mori, N.N.; Prieto, M.L.; Veldic, M.; Singh, B.; Gardea-Resendez, M.; et al. Clinical characterization of patients with bipolar disorder and a history of asthma: An exploratory study. J. Psychiatr. Res. 2023, 164, 8–14. [Google Scholar] [CrossRef]
Price, A.L.; Marzani-Nissen, G.R. Bipolar disorders: A review. Am. Fam. Physician 2012, 85, 483–493. [Google Scholar] [PubMed]
Taloyan, M.; Alinaghizadeh, H.; Wettermark, B.; Jan Hasselstrom, J.H.; Bertilson, B.C. Physical-mental multimorbidity in a large primary health care population in Stockholm County, Sweden. Asian J. Psychiatry 2023, 79, 103354. [Google Scholar] [CrossRef] [PubMed]
Gong, W.; Guo, P.; Li, Y.; Liu, L.; Yan, R.; Liu, S.; Wang, S.; Xue, F.; Zhou, X.; Yuan, Z. Role of the Gut-Brain Axis in the Shared Genetic Etiology Between Gastrointestinal Tract Diseases and Psychiatric Disorders: A Genome-Wide Pleiotropic Analysis. JAMA Psychiatry 2023, 80, 360–370. [Google Scholar] [CrossRef]
Liu, W.; Wu, W.; Wei, Y.; Wu, J.; Li, T.; Zhu, Q.; Ye, L.; Hong, F.; Gao, Y.; Yang, J. Epidemiologic characteristics and risk factors in patients with ketamine-associated lower urinary tract symptoms accompanied by urinary tract infection: A cross-sectional study. Medicine 2019, 98, e15943. [Google Scholar] [CrossRef] [PubMed]
Urback, A.L.; Metcalfe, A.W.; Korczak, D.J.; MacIntosh, B.J.; Goldstein, B.I. Reduced cerebrovascular reactivity among adolescents with bipolar disorder. Bipolar Disord. 2019, 21, 124–131. [Google Scholar] [CrossRef] [PubMed]
Harrison, P.J.; Luciano, S. Incidence of Parkinson’s disease, dementia, cerebrovascular disease and stroke in bipolar disorder compared to other psychiatric disorders: An electronic health records network study of 66 million people. Bipolar Disord. 2021, 23, 454–462. [Google Scholar] [CrossRef]
Kesebir, S.; Koc, M.I.; Yosmaoglu, A. Bipolar Spectrum Disorder May Be Associated With Family History of Diseases. J. Clin. Med. Res. 2020, 12, 251–254. [Google Scholar] [CrossRef]

Figure 1. Analysis procedures of this study: (a) the 1st procedure; (b) the 2nd procedure. Acronyms: SVM, support vector machine; DT, decision tree; NN, artificial neural network.

Figure 2. The respective probability density functions estimated by KDE of patients and control samples with a matching ratio of 1:1.

Figure 3. The respective probability density functions estimated by KDE of patients and control samples with a matching ratio of 1:3.

Table 1. The evaluation results of the test set on prediction models of bipolar disorder trained using various learning algorithms for patients and controls with a matching ratio of 1:1; only the soft label information X_pdf was used as input attributes.

Algorithm	TP	FP	TN	FN	Accuracy	PPV	NPV	TPR	TNR
X_pdf was generated using the Gaussian kernel.
SVM	1641	509	1497	353	0.785	0.763	0.809	0.823	0.746
Decision tree	1384	592	1414	610	0.700	0.700	0.699	0.694	0.705
NN (80, 10, 1)	1549	566	1425	336	0.767	0.732	0.809	0.822	0.716
NN (160, 40, 1)	1512	516	1466	459	0.753	0.746	0.762	0.767	0.740
NN (80, 20, 10, 1)	1540	384	1485	536	0.767	0.800	0.735	0.742	0.795
X_pdf was generated using the exponential kernel.
SVM	1626	498	1508	368	0.784	0.766	0.804	0.815	0.752
Decision tree	1371	561	1445	623	0.704	0.710	0.699	0.688	0.720
NN (80, 10, 1)	1597	500	1468	306	0.792	0.762	0.828	0.839	0.746
NN (160, 40, 1)	1551	549	1447	349	0.770	0.739	0.806	0.816	0.725
NN (80, 20, 10, 1)	1566	509	1489	358	0.779	0.755	0.806	0.814	0.745

Table 2. The evaluation results of the test set for patients and controls with a matching ratio of 1:1; both X and X_pdf were used as the input attributes.

Algorithm	TP	FP	TN	FN	Accuracy	PPV	NPV	TPR	TNR
X_pdf was generated using the Gaussian kernel.
SVM	1650	463	1543	344	0.798	0.781	0.818	0.827	0.769
Decision tree	1569	434	1572	425	0.785	0.783	0.787	0.787	0.784
NN (80, 10, 1)	1612	430	1562	347	0.803	0.789	0.818	0.823	0.784
NN (160, 40, 1)	1605	397	1597	394	0.802	0.802	0.802	0.803	0.801
NN (80, 20, 10, 1)	1620	485	1514	276	0.805	0.770	0.846	0.854	0.757
X_pdf was generated using the exponential kernel.
SVM	1654	492	1514	340	0.792	0.771	0.817	0.829	0.755
Decision tree	1579	425	1581	415	0.790	0.788	0.792	0.792	0.788
NN (80, 10, 1)	1602	482	1554	315	0.798	0.769	0.831	0.836	0.763
NN (160, 40, 1)	1620	389	1633	372	0.810	0.806	0.814	0.813	0.808
NN (80, 20, 10, 1)	1584	432	1563	401	0.791	0.786	0.796	0.798	0.783

Table 3. The evaluation results of the test set for patients and controls with a matching ratio of 1:1; only X was used as the input attributes.

Algorithm	TP	FP	TN	FN	Accuracy	PPV	NPV	TPR	TNR
SVM	1044	413	1593	950	0.659	0.717	0.626	0.524	0.794
Decision tree	1563	425	1581	431	0.786	0.786	0.786	0.784	0.788
NN (80, 10, 1)	1549	483	1484	432	0.768	0.762	0.775	0.782	0.754
NN (160, 40, 1)	1612	444	1533	332	0.802	0.784	0.822	0.829	0.775
NN (80, 20, 10, 1)	1505	526	1449	464	0.749	0.741	0.757	0.764	0.734

Table 4. The OR and corresponding 95% CI values between the learning model constructed using both X and X_pdf (Table 2) and the model constructed using X only (Table 3) for patients and controls with a matching ratio of 1:1.

Algorithm	OR for Accuracy	OR for PPV	OR for TPR
X_pdf was generated using the Gaussian kernel.
SVM	2.045 (2.390–1.750)	1.410 (1.648–1.206)	4.365 (5.101–3.735)
Decision tree	0.996 (1.158–0.856)	0.983 (1.143–0.845)	1.018 (1.184–0.875)
NN (80, 10, 1)	1.232 (1.443–1.053)	1.169 (1.369–0.998)	1.296 (1.517–1.107)
NN (160, 40, 1)	0.999 (1.167–0.855)	1.114 (1.301–0.953)	0.839 (0.980–0.718)
NN (80, 20, 10, 1)	1.380 (1.625–1.172)	1.167 (1.375–0.991)	1.810 (2.131–1.537)
X_pdf was generated using the exponential kernel.
SVM	1.968 (2.298–1.686)	1.330 (1.553–1.139)	4.427 (5.168–3.792)
Decision tree	1.024 (1.193–0.880)	1.010 (1.176–0.868)	1.049 (1.222–0.901)
NN (80, 10, 1)	1.195 (1.399–1.020)	1.036 (1.214–0.885)	1.418 (1.662–1.211)
NN (160, 40, 1)	1.055 (1.235–0.901)	1.147 (1.343–0.980)	0.897 (1.050–0.766)
NN (80, 20, 10, 1)	1.266 (1.475–1.087)	1.282 (1.493–1.100)	1.218 (1.419–1.045)

Table 5. The evaluation results of the test set on the prediction models of bipolar disorder trained using various learning algorithms for patients and controls with a matching ratio of 1:3; only the soft label information X_pdf was used as the input attributes.

Algorithm	TP	FP	TN	FN	Accuracy	PPV	NPV	TPR	TNR
X_pdf was generated using the Gaussian kernel.
SVM	1427	1155	4840	578	0.783	0.553	0.893	0.712	0.807
Decision tree	1006	1010	4985	999	0.749	0.499	0.833	0.502	0.832
NN (80, 10, 1)	1168	823	5153	856	0.790	0.587	0.858	0.577	0.862
NN (160, 40, 1)	1081	806	5173	940	0.782	0.573	0.846	0.535	0.865
NN (80, 20, 10, 1)	1294	877	5112	717	0.801	0.596	0.877	0.643	0.854
X_pdf was generated using the exponential kernel.
SVM	1430	1117	4878	575	0.789	0.561	0.895	0.713	0.814
Decision tree	1023	1050	4945	979	0.746	0.493	0.835	0.511	0.825
NN (80, 10, 1)	1363	991	5013	633	0.797	0.579	0.888	0.683	0.835
NN (160, 40, 1)	1307	909	5094	690	0.800	0.590	0.881	0.654	0.849
NN (80, 20, 10, 1)	1434	1041	4914	611	0.794	0.579	0.889	0.701	0.825

Table 6. The evaluation results of the test set for patients and controls with a matching ratio of 1:3; both X and X_pdf were used as the input attributes.

Algorithm	TP	FP	TN	FN	Accuracy	PPV	NPV	TPR	TNR
X_pdf was generated using the Gaussian kernel.
SVM	1168	660	5335	837	0.813	0.639	0.864	0.583	0.890
Decision tree	1335	691	5304	670	0.830	0.659	0.888	0.666	0.885
NN (80, 10, 1)	1126	421	5577	876	0.838	0.728	0.864	0.562	0.930
NN (160, 40, 1)	1356	777	5172	695	0.816	0.636	0.882	0.661	0.869
NN (80, 20, 10, 1)	1420	227	5706	647	0.891	0.862	0.898	0.687	0.962
X_pdf was generated using the exponential kernel.
SVM	1188	668	5327	817	0.814	0.640	0.867	0.593	0.889
Decision tree	1315	710	5285	690	0.825	0.649	0.885	0.656	0.882
NN (80, 10, 1)	1560	394	5588	458	0.894	0.798	0.924	0.773	0.934
NN (160, 40, 1)	1265	591	5410	734	0.834	0.682	0.881	0.633	0.902
NN (80, 20, 10, 1)	1278	717	5325	680	0.825	0.641	0.887	0.653	0.881

Table 7. The evaluation results of the test set for patients and controls with a matching ratio of 1:3; only X was used as the input attributes.

Algorithm	TP	FP	TN	FN	Accuracy	PPV	NPV	TPR	TNR
SVM	297	116	5879	1708	0.772	0.719	0.775	0.148	0.981
Decision tree	1276	752	5243	729	0.815	0.629	0.878	0.636	0.875
NN (80, 10, 1)	1196	739	5238	827	0.804	0.618	0.864	0.591	0.876
NN (160, 40, 1)	1184	725	5260	831	0.806	0.620	0.864	0.588	0.879
NN (80, 20, 10, 1)	1179	788	5215	818	0.799	0.599	0.864	0.590	0.869

Table 8. The OR and corresponding 95% CI values between the learning model constructed using both X and X_pdf (Table 6) and the model constructed using X only (Table 7) for patients and controls with a matching ratio of 1:3.

Algorithm	OR for Accuracy	OR for PPV	OR for TPR
X_pdf was generated using the Gaussian kernel.
SVM	1.283 (1.447–1.138)	0.691 (0.779–0.613)	8.025 (9.049–7.117)
Decision tree	1.108 (1.252–0.981)	1.139 (1.286–1.008)	1.138 (1.286–1.008)
NN (80, 10, 1)	1.258 (1.436–1.102)	1.653 (1.887–1.447)	0.889 (1.015–0.778)
NN (160, 40, 1)	1.071 (1.206–0.951)	1.069 (1.203–0.949)	1.369 (1.542–1.216)
NN (80, 20, 10, 1)	2.048 (2.408–1.742)	4.181 (4.916–3.556)	1.523 (1.790–1.295)
X_pdf was generated using the exponential kernel.
SVM	1.296 (1.461–1.149)	0.695 (0.783–0.616)	8.362 (9.429–7.416)
Decision tree	1.071 (1.209–0.949)	1.092 (1.232–0.967)	1.089 (1.229–0.965)
NN (80, 10, 1)	2.042 (2.363–1.765)	2.446 (2.831–2.114)	2.355 (2.725–2.035)
NN (160, 40, 1)	1.216 (1.378–1.074)	1.311 (1.484–1.157)	1.210 (1.370–1.068)
NN (80, 20, 10, 1)	1.187 (1.340–1.051)	1.191 (1.345–1.055)	1.304 (1.472–1.155)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tseng, Y.-S.; Yang, M.-H. Using Kernel Density Estimation in Knowledge Distillation to Construct the Prediction Model for Bipolar Disorder Patients. Appl. Sci. 2023, 13, 10280. https://doi.org/10.3390/app131810280

AMA Style

Tseng Y-S, Yang M-H. Using Kernel Density Estimation in Knowledge Distillation to Construct the Prediction Model for Bipolar Disorder Patients. Applied Sciences. 2023; 13(18):10280. https://doi.org/10.3390/app131810280

Chicago/Turabian Style

Tseng, Yu-Shiang, and Meng-Han Yang. 2023. "Using Kernel Density Estimation in Knowledge Distillation to Construct the Prediction Model for Bipolar Disorder Patients" Applied Sciences 13, no. 18: 10280. https://doi.org/10.3390/app131810280

APA Style

Tseng, Y. -S., & Yang, M. -H. (2023). Using Kernel Density Estimation in Knowledge Distillation to Construct the Prediction Model for Bipolar Disorder Patients. Applied Sciences, 13(18), 10280. https://doi.org/10.3390/app131810280

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Kernel Density Estimation in Knowledge Distillation to Construct the Prediction Model for Bipolar Disorder Patients

Abstract

1. Introduction

2. Materials and Methods

2.1. The Input Data

2.2. Kernel Density Estimation

2.3. Embedding Vector

2.4. Machine Learning Algorithms

2.5. The Analysis Procedure

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI