Imputing missing values using sklearn IterativeImputer class for MICE

Question

I'm trying to learn how to implement MICE in imputing missing values for my datasets. I've heard about fancyimpute's MICE, but I also read that sklearn's IterativeImputer class can accomplish similar results. From sklearn's docs:

Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1], but differs from it by returning a single imputation instead of multiple imputations. However, IterativeImputer can also be used for multiple imputations by applying it repeatedly to the same dataset with different random seeds when sample_posterior=True

I've seen "seeds" being used in different pipelines, but I never understood them well enough to implement them in my own code. I was wondering if anyone could explain and provide an example on how to implement seeds for a MICE imputation using sklearn's IterativeImputer? Thanks!

If you are willing to forego sklearn you can try [miceforest](https://pypi.org/project/miceforest/). — Francis Laclé, Dec 16 '21 at 15:13

score 4 · Answer 1 · answered Oct 29 '19 at 21:44

IterativeImputer behavior can change depending on a random state. The random state which can be set is also called a "seed".

As stated by the documentation, we can get multiple imputations when setting sample_posterior to True and changing the random seeds, i.e. the parameter random_state.

Here is an example of how to use it:

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

X_train = [[1, 2],
           [3, 6],
           [4, 8],
           [np.nan, 3],
           [7, np.nan]]
X_test = [[np.nan, 2],
          [np.nan, np.nan],
          [np.nan, 6]]

for i in range(3):
    imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True)
    imp.fit(X_train)
    print(f"imputation {i}:")
    print(np.round(imp.transform(X_test)))

It outputs:

imputation 0:
[[ 1.  2.]
 [ 5. 10.]
 [ 3.  6.]]
imputation 1:
[[1. 2.]
 [0. 1.]
 [3. 6.]]
imputation 2:
[[1. 2.]
 [1. 2.]
 [3. 6.]]

We can observe the three different imputations.

Would it be correct to pool the three imputations into a single set? If so, how would you accomplish this? I'm probably misunderstanding your explanation, but it looks like I would be creating 3 different datasets, each representing a different imputation seed. — Glenn G., Oct 29 '19 at 22:17
It is indeed creating 3 different datasets. How to use it depends on your final task (classification, regression, etc. or just to infer the missing values of your features?). I would suggest to ask another question, and it is probably better on Cross Validated than Stack Overflow. — Stanislas Morbieu, Oct 29 '19 at 22:23
@GlennG. were you able to figure out how to pool the datasets into a single dataset? I am also currently in the same position, and would like to fill the missing values in my features. — Vandan Revanur, Feb 01 '20 at 17:25

score 1 · Answer 2 · answered Sep 01 '22 at 20:02

A way to go about stacking the data might be to change @Stanislas' code around a bit like so:

mvi = {} # just my preference for dict, you can use a list too
# mvi collects each dataframe into a dict of dataframes using index: 0 thru 2

for i in range(3):
    imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True)
    mvi[i] = np.round(imp.fit_transform(X_train))

combine the imputations into a single dataset using

# a. pandas concat, or 
pd.concat(list(dfImp.values()), axis=0)

#b. np stack
dfs = np.stack(list(dfImp.values()), axis=0)

pd.concat creates a 2D data, on the other hand,np.stack creates a 3D array that you can reshape into 2D. The breakdown of the numpy 3D is as follows:

axis 0: num of iterated dataframes
axis 1: len of original df (num of rows)
axis 2: num of columns in original dataframe

create a 2D from 3D

You can use numpy reshape like so:

np.reshape(dfs, newshape=(dfs.shape[0]*dfs.shape[1], -1))

which means you essentially multiply axis 0 by axis 1 to stack the dataframes into one big dataframe. The -1 at the end just means that whatever axes is left off, use that, in this case it is the columns.

Imputing missing values using sklearn IterativeImputer class for MICE

2 Answers2

Linked