Appendix A.1
Mean imputation is a simple and commonly used method to handle missing data. This algorithm replaces missing values in each variable with the mean value of the observed data for the same variable. The mean imputation algorithm can be described as follows:
Algorithm A1 Mean Imputation Algorithm (for benchmarking only). |
- 1:
for each variable with missing data do - 2:
Calculate the mean of the observed values for the current variable. - 3:
Replace missing values in the current variable with the calculated mean. - 4:
end for
|
Mean imputation offers a quick and easy way to handle missing data. However, it may not always be the best method, as it can lead to underestimated variances and biased estimates, especially when the data is not missing completely at random. More advanced methods like multiple imputation are often preferred for handling missing data in statistical analyses.
kNN imputation in the R package VIM REFERENCE uses Gower’s distance, which is suitable for mixed data that can include ordinal, continuous, and categorical variables.
Gower’s distance is particularly useful because it can handle different types of variables by converting each type of variable to a [0, 1] scale, then computes the distances as a sum of the scaled differences over all variables.
For each missing value in the dataset, the algorithm finds the k nearest observations based on the Gower’s distance. If there are k such neighbors, it then imputes the missing value: For continuous or ordinal data, it uses the mean value of the neighbors. For categorical data, it uses the mode (most common value) of the neighbors. If there are less than k neighbors available, the algorithm skips that missing value and moves to the next, see Algorithm A2.
Algorithm A2 kNN Imputation Algorithm using Gower’s Distance. |
- 1:
procedure kNN() - 2:
for each i where is missing do - 3:
the k points in with the smallest Gower’s distances to i that are not missing - 4:
if there are k such then - 5:
if is continuous or ordinal then - 6:
median() - 7:
else if is categorical then - 8:
mode() - 9:
end if - 10:
else - 11:
Continue to the next i - 12:
end if - 13:
end for - 14:
return - 15:
end procedure
|
Please note that the actual implementation in the VIM package is more complex as it handles different edge cases and is optimised in terms of performance.
Multiple Imputation by Chained Equations (MICE), also known as Fully Conditional Specification (FCS), is another popular method for handling missing data. This method works by performing multiple imputations for the missing values, creating several different complete datasets. The results from these datasets can then be pooled to create a single, more robust estimate. Algorithm A3 represents a simplified version of the MICE algorithm.
Algorithm A3 Multiple Imputation by Chained Equations (MICE). |
- 1:
procedure MICE() - 2:
Initialize with simple imputations (e.g., mean imputation) - 3:
for to m do - 4:
for to do - 5:
for each variable V with missing values in do - 6:
Predict V given other variables using (create a prediction model) - 7:
Replace missing values in V in with predictions from the model - 8:
end for - 9:
end for - 10:
end for - 11:
return - 12:
end procedure
|
“Midastouch”, on the other hand, also fits a linear regression model to predict missing values but it modifies the selection criteria, see Algorithm A5 [
18].
In the Multiple Imputation by Chained Equations (MICE) framework, for categorical variables, by default, the method used is Bayesian polytomous logistic regression. This method is suitable for categorical variables (both ordered and unordered).
The Bayesian polytomous logistic regression creates a probabilistic model for each category level and uses these probabilities to impute missing values. This approach accounts for the uncertainty of the imputed values and naturally handles the categorical nature of the variable.
The exact method might slightly differ based on the number of categories, order of categories (for ordinal data), and other factors. For example, for binary variables (a special case of categorical variables with two levels), the MICE algorithm uses logistic regression as a default.
Algorithm A4 MICE using Predictive Mean Matching (PMM). |
- 1:
procedure MICE_PMM() - 2:
Initialize with simple imputations (e.g., mean imputation) - 3:
for to m do - 4:
for to do - 5:
for each variable V with missing values in do - 6:
Predict V given other variables using a linear regression model in - 7:
For each missing value in V in , find set S of observed values in V that are closest to the predicted value - 8:
Replace missing value with a random selection from set S - 9:
end for - 10:
end for - 11:
end for - 12:
return - 13:
end procedure
|
Algorithm A5 Midastouch Imputation. |
- 1:
procedure Midastouch() - 2:
Identify (observed values) and (missing values) in y in - 3:
Draw a bootstrap sample from the donor pool of , called - 4:
Estimate a beta matrix on using the leave one out principle - 5:
Compute type II predicted values for and using the beta matrix, producing predicted (nobs × 1) and predicted (nmis × nobs) - 6:
Calculate the distance between each predicted and the corresponding predicted - 7:
Convert the distances to drawing probabilities - 8:
for each missing value in do - 9:
Draw a donor from the entire donor pool considering the drawing probabilities - 10:
Replace the missing value with the observed value of the selected donor in y - 11:
end for - 12:
return - 13:
end procedure
|
Random Forest is a powerful machine learning algorithm that can also be used for imputation of missing data. The R package ranger provides an efficient implementation of Random Forest, which can be used for imputation.
Algorithm A6 is a simplified version of the Random Forest imputation algorithm using ranger.
This approach takes advantage of the strengths of Random Forest, including its ability to handle non-linear relationships and interactions between variables.
Please note that in practice, additional steps might be necessary for tuning the Random Forest parameters (NumTrees and mTry), assessing the quality of the imputations, and handling different types of variables (continuous, ordinal, categorical).
Algorithm A6 Random Forest Imputation using Ranger. |
- 1:
procedure missRanger() - 2:
for each variable V with missing values in do - 3:
Create a copy of , called - 4:
Replace missing values in V in with median(V) (or mode for categorical variables) - 5:
Build a Random Forest model with trees and variables tried at each split, using to predict V - 6:
Predict missing values in V in using the Random Forest model, and replace missing values with predictions - 7:
end for - 8:
return - 9:
end procedure
|