2

Sklearn implements an imputer called the IterativeImputer. I believe that it works by predicting the values for missing features values in a round robin fashion, using an estimator.

It has an argument called sample_posterior but I can't seem to figure out when I should use it.

sample_posterior boolean, default=False

Whether to sample from the (Gaussian) predictive posterior of the fitted estimator for each imputation. Estimator must support return_std in its predict method if set to True. Set to True if using IterativeImputer for multiple imputations.

I looked at the source code but it still wasn't clear to me. Should I use this if I have multiple features that I am going to fill using the iterative imputer or should I use this if I plan to use the imputer multiple times like for a training and then validation set?

Campbell Hutcheson
  • 549
  • 2
  • 4
  • 12

1 Answers1

2

Even with multiple features, and a training and validation/test set, you don't need sample_posterior. The "multiple imputations" part of the docstring means generating more than one missings-replaced dataset; see e.g. wikipedia.

Normally, IterativeImputer imputes the missing values of a feature using the predictions of a model built on the other features (iteratively, round robin, etc.). If you use a model that produces not just a single prediction but an output distribution (the posterior), then you can sample from that distribution randomly, hence sample_posterior. By running it multiple times, with different random seeds, these random choices are different, and you get multiple imputed datasets. The documentation on that isn't great, but there's a (somewhat aged) PR for an extended example, and a toy example on SO.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • The PR link was helpful - and there's some more in the sklearn docs here: https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation – Brian Bien Dec 28 '21 at 13:07