A Neighborhood-Similarity-Based Imputation Algorithm for Healthcare Data Sets: A Comparative Study
Abstract
:1. Introduction
- Normal imputation: When the data is numerical, we can use simple techniques, such as mean or modal values for a feature, to fill in the missing data. For data that is more categorical (i.e., they have a defined and limited range of possible values), then the most frequently occurring modal value for this feature can be used.
- Class-based imputation: Instead of replacing missing data with a calculated value based on existing feature values above, the replacement is done based on some internal classification. This approach determines the replacement value based on the values of a restricted subclass of known feature values.
- Model-based imputation: A hybrid approach where the missing value is considered as the class, and all the remaining features are used to train the model for predicting values.
- Reducing the speed degradation of the algorithm as the size of the data set increases.
- The way imputed values are selected is more localized rather than potentially using all similar values in the data set.
- Reducing the negative impact of outlying values by making imputed values selection more localized.
- Providing a solution that can be extended for use with textual and categorical data, as well as numeric data.
2. Background and Related Work
2.1. Imputation by Mean/Mode/Median and Others
- The kNNs is a relatively slow algorithm, with its performance decreasing as the size of the data set increases.
- The kNNs suffers from the curse of dimensionality [15]. As the number of feature values (dimensions) per record increases, the amount of data required to predict a new data point increases exponentially.
- The manner in which kNNs measures the closeness of a pair of records is quite simple, by using Euclidean or Manhattan distances for example.
- The kNNs algorithm needs homogeneity such that all the features must be measured using the same scale, since the distance is taken as an absolute measure.
- The kNNs does not work well with imbalanced data. Given two potential choices of classification, the algorithm will naturally tend to be biased towards a result taken from the largest data subsets, thus leading to potentially more misclassifications.
- The kNNs is sensitive to outlying values, as the choice of closest neighbors is based on an absolute measure of the distance.
2.2. Simple Statistical Imputation Techniques
- Identify missing values in the source data set.
- Iterate through the data set. For each record with missing values, replace each missing value with a statistical measure based on values for the same field found in other records where this field is not missing.
- Once all the records have been completed, if the nature of the data set meets the criteria for its intended use, then stop; otherwise, repeat Step 2.
2.3. Multistage Techniques
3. Proposed Algorithm
3.1. Proposal Main Steps
- Apply our imputation technique to fill in each missing attribute in turn, where i corresponds to the ith feature in each patient record, for the current record r to create a complete record in D. This will become the basis of the later comparisons. Incomplete records r are given by
- Use the k-fold (with k = 10) [27,28] technique to partition D into non-intersecting subsets. In turn, each subset (fold) will be considered to be the test fold, and the remaining folds will be used as training folds. For each record in the test fold, we apply a comparison function , which is in our case the cosine similarity, to obtain a numerical measure of how similar the test record is to the current record in the training folds. An ordered similarity table, S, is maintained and stores details of each training record and how similar it is to the current test record. This is repeated until the test record has been compared against all the records in all the training folds. After each change to the contents of S, it will be sorted in such a way that the most similar training record will appear as the first item in the list. This could be more complicated depending on the comparison function used, but in our case, the sort order is merely used to maintain the n-closest items (defining the neighborhood) in S in an increasing cosine similarity order. The contents of S must be cleared once all the training set records have been compared and are ready for subsequent cycles.
3.2. Similarity Model Behavior
3.3. Empirical Bayes Correction
Motivation
4. Performance Evaluation
4.1. Implementation Overview
- If there is a unique modal value in C, then use this value as the imputed feature value.
- For those modal values which occur in C with equal highest frequency, if one of these modal values has the same feature value as the actual feature value of the most similar complete record in , then select this modal value as the new imputed feature value for the current incomplete record.
- Determine whether one of the values in C lies closer to the median value of the candidate set than the others. If such a value is found, select this as the imputed feature value.
- If none of the previous rules have been satisfied, then select the mean value of C.
4.2. Evaluation of RMSE
4.3. Simulated Dataset
4.4. Pima Indians Diabetes Data Set
4.4.1. Comparison with Popular Imputation Methods
- Remove Incomplete Records (Listwise Deletion): Any records in D that have one or more missing feature values are removed from the data set prior to processing. The removal of any incomplete records will lead to a smaller but complete data set D. It is not recommended that this technique is used arbitrarily as a means of direct comparison with other techniques used in the paper, since factors, such as the initial completeness of D, need to be assessed. It has been included due to its general popularity only (Figure 2).
- Replace Missing Data With Mean Attribute Value: Any missing feature values are replaced with the average value calculated from the corresponding feature values in all the complete records in the data set.
- Replace Missing Data With Modal Attribute Value: Any missing feature values are replaced with the most common value gathered from the corresponding feature values from all the complete records in the data set.
- Replace Missing Data Using Empirical Bayes Algorithm: This method is for statistically inferring missing feature values using a prior distribution of known values in a data set.
- Replace Missing Data With N-Similarity Algorithm: Any missing feature values are replaced with the best candidate value calculated from the corresponding feature values in the N-most-similar complete records in the data set.
4.4.2. Results, Limitations, and Discussion
4.4.3. Benchmarking with kNN
5. Conclusions
6. Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
RMSE | Root Mean Squared Error |
NSIM | Our Neighbourhood SIMilarity algorithm |
NSIM-EB | Our Neighbourhood SIMilarity algorithm with Empirical Bayes Correction |
kNN | Classification of Nearest Neighbour algorithms |
TP | True Positive |
FP | False Positive |
TN | True Negative |
FN | False Negative |
TPR | True Positive Rate |
FPR | False Positive Rate |
MAV | Mean Average Value |
MDAV | Modal Average Value |
MCC | Matthews Correlation |
BMI | Body Mass Index |
BP | Blood Pressure |
DPf | Diabetes Pedigree Function |
References
- Tang, J.; Zhang, X.; Yin, W.; Zou, Y.; Wang, Y. Missing data imputation for traffic flow based on combination of fuzzy neural network and rough set theory. J. Intell. Transp. Syst. Technol. Plan. Oper. 2019, 5, 439–454. [Google Scholar] [CrossRef]
- Agrawal, R.; Prabakaran, S. Big data in digital healthcare: Lessons learnt and recommendations for general practice. Heredity 2020, 124, 525–534. [Google Scholar] [CrossRef] [PubMed]
- Adam, K. Big Data Analysis And Storage. In Proceedings of the 2015 International Conference on Operations Excellence and Service Engineering, Orlando, FL, USA, 10–11 September 2015; pp. 648–658. [Google Scholar]
- Ford, E.; Rooney, P.; Hurley, P.; Oliver, S.; Bremner, S.; Cassell, J. Can the Use of Bayesian Analysis Methods Correct for Incompleteness in Electronic Health Records Diagnosis Data? Development of a Novel Method Using Simulated and Real-Life Clinical Data. Public Health 2020, 8, 54. [Google Scholar] [CrossRef] [PubMed]
- Xiaochen, L.; Xia, W.; Liyong, Z.; Wei, L. Imputations of missing values using a tracking-removed autoencoder trained with incomplete data. Neurocomputing 2019, 266, 54–65. [Google Scholar] [CrossRef]
- Singhal, S. Defining, Analysing, and Implementing Imputation Techniques. 2021. Available online: https://www.analyticsvidhya.com/blog/2021/06/defining-analysing-and-implementing-imputation-techniques/ (accessed on 22 November 2023).
- Beretta, L.; Santaniello, A. Nearest neighbor imputation algorithms: A critical evaluation. BMC Med. Inform. Decis. Mak. 2016, 16, 197–208. [Google Scholar] [CrossRef] [PubMed]
- Khaled, F.; Mahmoud, I.; Ahmad, A.; Arafa, M. Advanced methods for missing values imputation based on similarity learning. Clim. Res. 2022, 7, e619. [Google Scholar] [CrossRef]
- Huang, G. Missing data filling method based on linear interpolation and lightgbm. J. Phys. Conf. Ser. 2021. [Google Scholar] [CrossRef]
- Peppanen, J.; Zhang, X.; Grijalva, S.; Reno, M.J. Handling bad or missing smart meter data through advanced data imputation. In Proceedings of the 2016 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT), Ljubljana, Slovenia, 9–12 October 2016; pp. 1–5. [Google Scholar] [CrossRef]
- Jackobsen, J.; Gluud, C.; Wetterslev, J.; Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—A practical guide with flowcharts. BMC Med. Res. Methodol. 2017, 17, 162. [Google Scholar] [CrossRef]
- Hayati Rezvan, P.; Lee, K.J.; Simpson, J.A. The rise of multiple imputation: A review of the reporting and implementation of the method in medical research. BMC Med. Res. Methodol. 2015, 15, 30. [Google Scholar] [CrossRef]
- Nguyen, C.; Carlin, J.; Lee, K. Practical strategies for handling breakdown of multiple imputation procedures. Emergent Themes Epidemiol. 2021, 18, 5. [Google Scholar] [CrossRef]
- Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN Model-Based Approach in Classification. In Confederated International Conferences “On The Move To Meaningful Internet Systems 2003”; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2003; Volume 2888, pp. 986–996. [Google Scholar] [CrossRef]
- Pohl, S.; Becker, B. Performance of Missing Data Approaches Under Nonignorable Missing Data Conditions. Methodology 2018, 16, 147–165. [Google Scholar] [CrossRef]
- Ali, N.; Neagu, D.; Trundle, P. Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets. SN Appl. Sci. 2019, 1, 1559. [Google Scholar] [CrossRef]
- Abu Alfeilat, H.A.; Hassanat, A.B.A.; Lasassmeh, O.; Tarawneh, A.S.; Alhasanat, M.B.; Eyal Salman, H.S.; Prasath, V.S. Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data 2019, 7, 221–248. [Google Scholar] [CrossRef]
- Khan, S.; Hoque, A. SICE: An improved missing data imputation technique. J. Big Data 2020, 7, 37. Available online: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00313-w (accessed on 22 November 2023). [CrossRef] [PubMed]
- Misztal, M. Imputation of Missing Data Using R. Acta Univ. Lodz. Folia Oeconomica 2012, 269, 131–144. [Google Scholar]
- Kowarik, A.; Templ, M. Imputation with the R Package VIM. J. Stat. Softw. 2016, 74, 1–16. [Google Scholar] [CrossRef]
- Choi, J.; Dekkers, O.; Le Cessie, S. A comparison of different methods to handle missing data in the context of propensity score analysis. Eur. J. Epidemiol. 2019, 34, 23–36. [Google Scholar] [CrossRef]
- Cetin-Berber, D.; Sari, H. Imputation Methods to Deal With Missing Responses in Computerized Adaptive Multistage Testing. Educ. Psychol. Meas. 2018, 79, 495–511. [Google Scholar] [CrossRef]
- Alwohaibi, M.; Alzaqebah, M. A hybrid multi-stage learning technique based on brain storming optimization algorithm for breast cancer recurrence prediction. J. King Saud Univ. Comput. Inf. Sci. 2021, 34, 5192–5203. [Google Scholar] [CrossRef]
- Kabir, G.; Tesfamariam, S.; Hemsing, J.; Rehan, S. Handling incomplete and missing data in water network database using imputation methods. Sustain. Resilient Infrastruct. 2020, 5, 365–377. [Google Scholar] [CrossRef]
- Mujahid, M.; Rustam, F.; Shafique, R.; Chunduri, V.; Villar, M.G.; Ballester, J.B.; Diez, I.D.L.T.; Ashraf, I. Analyzing Sentiments Regarding ChatGPT Using Novel BERT: A Machine Learning Approach. Information 2023, 14, 474. [Google Scholar]
- Mujahid, M.; Rehman, A.; Alam, T.; Alamri, F.S.; Fati, S.M.; Saba, T. An Efficient Ensemble Approach for Alzheimer’s Disease Detection Using an Adaptive Synthetic Technique and Deep Learning. Diagnostics 2023, 13, 2489. [Google Scholar] [CrossRef] [PubMed]
- Nti, I.; Nyarko-Boateng, O.; Aning, J. Performance of Machine Learning Algorithms with Different K Values in K-Fold Cross Validation; MECS Press: Hong Kong, China, 2021. [Google Scholar] [CrossRef]
- Brownlee, J. How to Configure k-Fold Cross-Validation; Machine Learning Mastery: San Juan, PR, USA, 2020. [Google Scholar]
- Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; Volume 793. [Google Scholar]
- Carlin, B.; Louis, T. Bayes and Empirical Bayes Methods for Data Analysis, 2nd ed.; Chapman and Hall CRC: Boca Raton, FL, USA, 2000. [Google Scholar] [CrossRef]
- Zhou, X.; Wang, X.; Dougherty, E.R. Missing-value estimation using linear and non-linear regression with Bayesian gene selection. Bioinformatics 2003, 19, 2302–2307. [Google Scholar] [CrossRef] [PubMed]
- Cheng, P.E. Nonparametric Estimation of Mean Functionals with Data Missing at Random. J. Am. Stat. Assoc. 1994, 89, 81–87. [Google Scholar] [CrossRef]
- Root Mean Squared Error Definition. 2022. Available online: https://www.sciencedirect.com/topics/engineering/root-mean-squared-error (accessed on 22 November 2023).
- Crookston, N.L.; Finley, A.O. yaImpute: An R package for kNN imputation. J. Stat. Softw. 2008, 23, 1–16. [Google Scholar] [CrossRef]
- PIMA Indian Diabetes Database. 2016. Available online: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (accessed on 22 November 2023).
- Lin, W.C.; Chih-Fong, T. Missing value imputation: A review and analysis of the literature (2006–2017). Artif. Intell. Rev. 2020, 53, 1487–1509. [Google Scholar] [CrossRef]
- Dong Y, P.C. Principled missing data methods for researchers. Springerplus 2013, 2, 222. [Google Scholar] [CrossRef]
- Huang, L.; Wang, C.; Rosenberg, N.A. The Relationship between Imputation Error and Statistical Power in Genetic Association Studies in Diverse Populations. Am. J. Hum. Genet. 2009, 85, 692–698. [Google Scholar] [CrossRef]
- Pepinsky, T.B. A Note on Listwise Deletion versus Multiple Imputation. Political Anal. 2018, 26, 480–488. [Google Scholar] [CrossRef]
- Lall, R. How multiple imputation makes a difference. Political Anal. 2006, 24, 414–433. [Google Scholar] [CrossRef]
- Allison, P. Listwise Deletion: It’s NOT Evil. 2014. Available online: https://statisticalhorizons.com/listwise-deletion-its-not-evil/ (accessed on 22 November 2023).
- Joachim Schork, S.G. Imputation Methods (Top 5 Popularity Ranking). 2019. Available online: https://statisticsglobe.com/imputation-methods-for-handling-missing-data/ (accessed on 22 November 2023).
N = 1 | N = 2 | N = 3 | N = 4 | N = 5 | N = 6 | N = 7 | N = 8 | N = 9 | N = 10 | |
---|---|---|---|---|---|---|---|---|---|---|
Accuracy | 55.64% | 73.37% | 58.01% | 69.84% | 58.84% | 70.82% | 60.13% | 66.81% | 60.74% | 67.41% |
Correlation | 76.97% | 88.91% | 89.65% | 89.55% | 89.57% | 89.06% | 89.30% | 89.63% | 89.33% | 88.99% |
Precision | 31.12% | 58.41% | 30.79% | 55.59% | 32.28% | 59.07% | 32.30% | 46.98% | 33.53% | 48.58% |
Recall | 33.01% | 55.13% | 28.71% | 39.62% | 25.50% | 29.73% | 23.53% | 32.57% | 24.23% | 28.60% |
Specificity | 66.18% | 81.47% | 71.34% | 83.76% | 74.22% | 89.57% | 77.04% | 82.43% | 77.49% | 85.12% |
TPR | 23.03% | 38.12% | 20.34% | 28.12% | 18.19% | 20.44% | 16.48% | 23.07% | 17.10% | 19.80% |
FPR | 33.82% | 18.53% | 28.66% | 16.24% | 25.78% | 10.43% | 22.96% | 17.57% | 22.51% | 14.88% |
Average MCC | −0.0495 | 0.4582 | 0.0792 | 0.3419 | 0.0842 | 0.3069 | 0.0383 | 0.2326 | 0.0123 | 0.1771 |
M | Method | ||||||
---|---|---|---|---|---|---|---|
NSIM | 1.382 | 1.402 | 1.414 | 1.378 | 1.345 | 1.333 | |
1 | NSIM-EB | 0.996 | 1.054 | 1.068 | 1.052 | 1.052 | 1.035 |
kNNs | 1.399 | 1.422 | 1.508 | 1.453 | 1.456 | 1.386 | |
NSIM | 1.421 | 1.401 | 1.398 | 1.402 | 1.401 | 1.380 | |
50 | NSIM-EB | 1.047 | 1.035 | 1.041 | 1.042 | 1.042 | 1.027 |
kNNs | 1.417 | 1.413 | 1.407 | 1.409 | 1.405 | 1.399 | |
NSIM | 1.420 | 1.396 | 1.385 | 1.386 | 1.385 | 1.373 | |
100 | NSIM-EB | 1.046 | 1.031 | 1.034 | 1.038 | 1.038 | 1.014 |
kNNs | 1.413 | 1.418 | 1.411 | 1.415 | 1.410 | 1.417 |
Feature | Data Type | Value Range (Zero Indicates Missing Value) |
---|---|---|
Number of Times Pregnant | Positive Integer | 0…17 |
Plasma Glucose Concentration | Real | 0…199 |
Diastolic Blood Pressure | Real | 0…122 |
Triceps Skinfold Thickness | Real | 0…99 |
Serum Insulin Levels | Real | 0…846 |
Body Mass Index | Real | 0…67.1 |
Diabetes Pedigree Function | Real | 0.078…2.42 |
Age | Positive Integer | 21…81 |
Classification | Binary | 1 = positive diagnosis, 0 = negative diagnosis |
Remove Incomplete Records (Listwise Deletion) | Replace Missing Data with MAV | Replace Missing Data with MDAV | Average N-Similarity Algorithm (N = 1…10) | |
---|---|---|---|---|
Number Of Perfect Tests | 10 | 10 | 10 | 10 |
Accuracy | 54.76% | 54.85% | 54.88% | 64.16% (+9.33%) |
Correlation | 92.48% | 94.92% | 95.00% | 88.06% (−6.07%) |
Precision | 36.94% | 31.31% | 31.32% | 42.86% (+9.67%) |
Recall | 31.26% | 36.65% | 37.35% | 32.06% (−3.03%) |
Specificity | 68.96% | 63.28% | 62.82% | 78.86% (+13.84%) |
True Positive Rate (TPR) | 22.23% | 25.17% | 25.21% | 22.47% (−1.73%) |
False Positive Rate (FPR) | 31.04% | 36.72% | 37.18% | 21.12% (−13.86%) |
Average MCC | 0.0891 | 0.0160 | −0.0413 |
M | Method | Pregnancy | Glucose | BP | Triceps | Insulin | BMI | DPf | Age |
---|---|---|---|---|---|---|---|---|---|
NSIM | 0.875 | 0.963 | 1.051 | 0.937 | 0.942 | 0.892 | 1.103 | 0.848 | |
1 | NSIM-EB | 0.737 | 0.770 | 0.777 | 0.816 | 0.726 | 0.782 | 0.791 | 0.752 |
kNNs | 0.872 | 0.948 | 1.101 | 1.043 | 0.896 | 0.968 | 1.013 | 0.884 | |
NSIM | 1.134 | 1.114 | 1.275 | 1.128 | 1.148 | 1.065 | 1.288 | 1.089 | |
5 | NSIM-EB | 0.900 | 0.899 | 0.962 | 0.948 | 0.882 | 0.899 | 0.937 | 0.894 |
kNNs | 1.343 | 1.328 | 1.265 | 1.315 | 1.230 | 1.322 | 1.272 | 1.289 | |
NSIM | 1.172 | 1.149 | 1.335 | 1.167 | 1.235 | 1.096 | 1.372 | 1.125 | |
10 | NSIM-EB | 0.942 | 0.928 | 0.992 | 0.958 | 0.956 | 0.927 | 0.984 | 0.891 |
kNNs | 1.382 | 1.356 | 1.331 | 1.406 | 1.293 | 1.360 | 1.349 | 1.263 | |
NSIM | 1.177 | 11.54 | 1.350 | 1.181 | 1.249 | 1.109 | 1.388 | 1.140 | |
15 | NSIM-EB | 0.944 | 0.940 | 1.004 | 0.971 | 0.961 | 0.936 | 1.030 | 0.917 |
kNNs | 1.419 | 1.370 | 1.376 | 1.418 | 1.333 | 1.359 | 1.404 | 1.379 | |
NSIM | 1.187 | 1.167 | 1.356 | 1.185 | 1.269 | 1.121 | 1.393 | 1.160 | |
20 | NSIM-EB | 0.959 | 0.942 | 1.006 | 0.969 | 0.995 | 0.950 | 1.014 | 0.928 |
kNNs | 1.399 | 1.359 | 1.378 | 1.397 | 1.336 | 1.367 | 1.345 | 1.372 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wilcox, C.; Giagos, V.; Djahel, S. A Neighborhood-Similarity-Based Imputation Algorithm for Healthcare Data Sets: A Comparative Study. Electronics 2023, 12, 4809. https://doi.org/10.3390/electronics12234809
Wilcox C, Giagos V, Djahel S. A Neighborhood-Similarity-Based Imputation Algorithm for Healthcare Data Sets: A Comparative Study. Electronics. 2023; 12(23):4809. https://doi.org/10.3390/electronics12234809
Chicago/Turabian StyleWilcox, Colin, Vasileios Giagos, and Soufiene Djahel. 2023. "A Neighborhood-Similarity-Based Imputation Algorithm for Healthcare Data Sets: A Comparative Study" Electronics 12, no. 23: 4809. https://doi.org/10.3390/electronics12234809
APA StyleWilcox, C., Giagos, V., & Djahel, S. (2023). A Neighborhood-Similarity-Based Imputation Algorithm for Healthcare Data Sets: A Comparative Study. Electronics, 12(23), 4809. https://doi.org/10.3390/electronics12234809