Hot Deck Imputation in Python

Question

I have been trying to find Python code that would allow me to replace missing values in a dataframe's column. The focus of my analysis is in biostatistics so I am not comfortable with replacing values using means/medians/modes. I would like to apply the "Hot Deck Imputation" method.

I cannot find any Python functions or packages online that takes the column of a dataframe and fills missing values with the "Hot Deck Imputation" method.

I did, however, see this GitHub project and did not find it useful.

The following is an example of some of my data (assume this is a pandas dataframe):

| age | sex | bmi  | anesthesia score | pain level |
|-----|-----|------|------------------|------------|
| 78  | 1   | 40.7 | 3                | 0          |
| 55  | 1   | 25.3 | 3                | 0          |
| 52  | 0   | 25.4 | 3                | 0          |
| 77  | 1   | 44.9 | 3                | 3          |
| 71  | 1   | 26.3 | 3                | 0          |
| 39  | 0   | 28.2 | 2                | 0          |
| 82  | 1   | 27   | 2                | 1          |
| 70  | 1   | 37.9 | 3                | 0          |
| 71  | 1   | NA   | 3                | 1          |
| 53  | 0   | 24.5 | 2                | NA         |
| 68  | 0   | 34.7 | 3                | 0          |
| 57  | 0   | 30.7 | 2                | 0          |
| 40  | 1   | 22.4 | 2                | 0          |
| 73  | 1   | 34.2 | 2                | 0          |
| 66  | 1   | NA   | 3                | 1          |
| 55  | 1   | 42.6 | NA               | NA         |
| 53  | 0   | 37.5 | 3                | 3          |
| 65  | 0   | 31.6 | 2                | 2          |
| 36  | 0   | 29.6 | 1                | 0          |
| 60  | 0   | 25.7 | 2                | NA         |
| 70  | 1   | 30   | NA               | NA         |
| 66  | 1   | 28.3 | 2                | 0          |
| 63  | 1   | 29.4 | 3                | 2          |
| 70  | 1   | 36   | 3                | 2          |

I would like to apply a Python function that would allow me to input a column as a parameter and return the column with the missing values replaced with imputed values using the "Hot Deck Imputation" method.

I am using this for the purpose of statistical modeling with models such as linear and logistic regression using Statsmodels.api. I am not using this for Machine Learning.

Any help would be much appreciated!

Would [`bfill`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bfill.html) or [`ffill`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ffill.html) do? Which should be a type of "Hot Code Imputation" (LOCF) — Sayandip Dutta, Dec 31 '19 at 09:12
How are the two methods different and how do I know that they use the Hot Code Imputation? — Zakariah Siyaji, Dec 31 '19 at 09:14
The `ffill` uses `last observation carried forward` LOCF Hot Code Imputation. — Prayson W. Daniel, Dec 31 '19 at 09:57
Is there possibly a more precise way of filling the missing values, excluding means/modes/medians? — Zakariah Siyaji, Dec 31 '19 at 10:01

Prayson W. Daniel · Accepted Answer · 2020-08-06T07:14:42.043

3

You can use ffill that uses last observation carried forward (LOCF) Hot Code Imputation.

#...
df.fillna(method='ffill', inplace=True)

Scikit-learn impute offers KNN, Mean, Max and other imputing methods. (https://scikit-learn.org/stable/modules/impute.html)

# sklearn '>=0.22.x'
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2, weights="uniform")

DF['imputed_x'] = imputer.fit_transform(DF[['bmi']])

print(DF['imputed_x'])

edited Aug 06 '20 at 07:14

answered Dec 31 '19 at 10:02

Prayson W. Daniel

14,191
4
51
57

1

Yes, give me a second. I will add examples – Prayson W. Daniel Dec 31 '19 at 10:07
I tried the code that you shared: ```impute = KNNImputer(n_neighbors=2, weights="uniform")``` ```DF['imputed_x'] = DF['bmi'].apply(lambda y: impute.fit_tranform(y),axis=0)``` ```print(DF['imputed_x'])``` The following were the errors that I got: TypeError: () got an unexpected keyword argument 'axis' – Zakariah Siyaji Dec 31 '19 at 10:32
1

Let me test it. – Prayson W. Daniel Dec 31 '19 at 10:38
1

Fixed. See that `df[['x']]`. Getting DataFrame and not series – Prayson W. Daniel Dec 31 '19 at 10:44
Thank you so much for all of the help! Can you please explain how to determine the best value for "n_neighbors"? – Zakariah Siyaji Dec 31 '19 at 10:50
1

that depends on your data. Try other imputers too, like `SimpleImputer` with different strategies (median, most frequent, ...) see what you get – Prayson W. Daniel Dec 31 '19 at 10:52

Hot Deck Imputation in Python

1 Answers1

Linked