4

I am learning currently about sklearn imputer and I found that there is one strategy that isn't implemented by the imputers.

I would like to build a pipeline that deletes the columns with any missing values or delete all the rows with missing values.

Why do I want this?

Because I would like to do a grid search and find the defect of any imputing method on my RMSE or classification score.

Is there a way I can do this with sklearn pipeline? Or should I create my own imputer?

If this has been asked before, feel free to suggest closing the question and pointing me out to the correct resource.

For more context, I have 21 features and 1000 data points, only one column has missing values and those missing values are 50% of the values in the columns. I just want to explore the effect of the missing value imputation method on my classifier's accuracy and f1 score.

Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73
  • So you want to compare missingRowsRemoved vs missingColumnsRemoved vs imputationMethod1 vs imputationMethod2 etc? Is this right? – RSale Jan 06 '22 at 18:26
  • Yeah.. that is right @RSale – Espoir Murhabazi Jan 06 '22 at 18:27
  • This needs more context. What kind of data do you use. What kind of problem are you solving? You are doing a grid search on what? – RSale Jan 06 '22 at 18:28
  • Imputing is an art and choosing the right methods depends entirely on the data you have. – RSale Jan 06 '22 at 18:37
  • It is numerical data basically, only one column has missing data I am not doing grid search yet but I am just exploring the effect of missing values on the accuracy score. – Espoir Murhabazi Jan 06 '22 at 18:58
  • 1
    Such a transformer wouldn't really be an "imputer". I don't know of a common package that provides such either. The "drop any column containing any missings" would be fairly easy to build out as a custom transformer. The "drop any row containing any missings" would be much more difficult, since sklearn assumes throughout that the rows stay in a fixed order and are neither dropped nor added. You might be able to make use of the `imblearn` package and its resampling pipelines, but it would be a little hacky. – Ben Reiniger Jan 06 '22 at 18:58
  • Ahhh.. look like i miss the whole point of imputer.. yeah you are right .. it wouldn't be an imputer. – Espoir Murhabazi Jan 06 '22 at 19:01

1 Answers1

2

I would suggest using autoimpute library. It's probably the best tool currently to deal with datasets that have missing values.

It has a function that does exactly what you asked, deletes rows with any missing values.

from autoimpute.imputations import MiceImputer, SingleImputer, listwise_delete

listwise_delete(df, inplace=True, verbose=False)

In general, sklearn's imputer is very limited in its usefulness and autoimpute is able to fill a lot of gaps. More specifically, it allows to:

  • Explicitly set columns that you would like to treat as variables in calculating the imputed values
  • Set different imputation algorithms for every column or a set of columns
si_dict_col = SingleImputer(
    strategy={"gender":"categorical", "salary": "pmm", "weight": "pmm"},
    predictors={"gender": ["salary", "weight", "looks"], "salary": ["weight", "gender"])

  • There are built-in methods to visualize different imputation method's results
plot_imp_scatter(data_het_miss, "x", "y", "least squares")

It also follows sklearn's patterns and can be substituted for sklearn's own imputer function in the pipeline.

user4718221
  • 561
  • 6
  • 20