3

I'm trying to learn how to implement MICE in imputing missing values for my datasets. I've heard about fancyimpute's MICE, but I also read that sklearn's IterativeImputer class can accomplish similar results. From sklearn's docs:

Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1], but differs from it by returning a single imputation instead of multiple imputations. However, IterativeImputer can also be used for multiple imputations by applying it repeatedly to the same dataset with different random seeds when sample_posterior=True

I've seen "seeds" being used in different pipelines, but I never understood them well enough to implement them in my own code. I was wondering if anyone could explain and provide an example on how to implement seeds for a MICE imputation using sklearn's IterativeImputer? Thanks!

Glenn G.
  • 419
  • 3
  • 7
  • 18

2 Answers2

4

IterativeImputer behavior can change depending on a random state. The random state which can be set is also called a "seed".

As stated by the documentation, we can get multiple imputations when setting sample_posterior to True and changing the random seeds, i.e. the parameter random_state.

Here is an example of how to use it:

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

X_train = [[1, 2],
           [3, 6],
           [4, 8],
           [np.nan, 3],
           [7, np.nan]]
X_test = [[np.nan, 2],
          [np.nan, np.nan],
          [np.nan, 6]]

for i in range(3):
    imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True)
    imp.fit(X_train)
    print(f"imputation {i}:")
    print(np.round(imp.transform(X_test)))

It outputs:

imputation 0:
[[ 1.  2.]
 [ 5. 10.]
 [ 3.  6.]]
imputation 1:
[[1. 2.]
 [0. 1.]
 [3. 6.]]
imputation 2:
[[1. 2.]
 [1. 2.]
 [3. 6.]]

We can observe the three different imputations.

Stanislas Morbieu
  • 1,721
  • 7
  • 11
  • Would it be correct to pool the three imputations into a single set? If so, how would you accomplish this? I'm probably misunderstanding your explanation, but it looks like I would be creating 3 different datasets, each representing a different imputation seed. – Glenn G. Oct 29 '19 at 22:17
  • 2
    It is indeed creating 3 different datasets. How to use it depends on your final task (classification, regression, etc. or just to infer the missing values of your features?). I would suggest to ask another question, and it is probably better on Cross Validated than Stack Overflow. – Stanislas Morbieu Oct 29 '19 at 22:23
  • 1
    @GlennG. were you able to figure out how to pool the datasets into a single dataset? I am also currently in the same position, and would like to fill the missing values in my features. – Vandan Revanur Feb 01 '20 at 17:25
1

A way to go about stacking the data might be to change @Stanislas' code around a bit like so:

mvi = {} # just my preference for dict, you can use a list too
# mvi collects each dataframe into a dict of dataframes using index: 0 thru 2

for i in range(3):
    imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True)
    mvi[i] = np.round(imp.fit_transform(X_train))

combine the imputations into a single dataset using

# a. pandas concat, or 
pd.concat(list(dfImp.values()), axis=0)

#b. np stack
dfs = np.stack(list(dfImp.values()), axis=0) 

pd.concat creates a 2D data, on the other hand,np.stack creates a 3D array that you can reshape into 2D. The breakdown of the numpy 3D is as follows:

  • axis 0: num of iterated dataframes
  • axis 1: len of original df (num of rows)
  • axis 2: num of columns in original dataframe

create a 2D from 3D

You can use numpy reshape like so:

np.reshape(dfs, newshape=(dfs.shape[0]*dfs.shape[1], -1))

which means you essentially multiply axis 0 by axis 1 to stack the dataframes into one big dataframe. The -1 at the end just means that whatever axes is left off, use that, in this case it is the columns.

GSA
  • 751
  • 8
  • 12