sklearn imputer drop column with missing values

Question

I am learning currently about sklearn imputer and I found that there is one strategy that isn't implemented by the imputers.

I would like to build a pipeline that deletes the columns with any missing values or delete all the rows with missing values.

Why do I want this?

Because I would like to do a grid search and find the defect of any imputing method on my RMSE or classification score.

Is there a way I can do this with sklearn pipeline? Or should I create my own imputer?

If this has been asked before, feel free to suggest closing the question and pointing me out to the correct resource.

For more context, I have 21 features and 1000 data points, only one column has missing values and those missing values are 50% of the values in the columns. I just want to explore the effect of the missing value imputation method on my classifier's accuracy and f1 score.

So you want to compare missingRowsRemoved vs missingColumnsRemoved vs imputationMethod1 vs imputationMethod2 etc? Is this right? — RSale, Jan 06 '22 at 18:26
This needs more context. What kind of data do you use. What kind of problem are you solving? You are doing a grid search on what? — RSale, Jan 06 '22 at 18:28
Imputing is an art and choosing the right methods depends entirely on the data you have. — RSale, Jan 06 '22 at 18:37
It is numerical data basically, only one column has missing data I am not doing grid search yet but I am just exploring the effect of missing values on the accuracy score. — Espoir Murhabazi, Jan 06 '22 at 18:58
Such a transformer wouldn't really be an "imputer". I don't know of a common package that provides such either. The "drop any column containing any missings" would be fairly easy to build out as a custom transformer. The "drop any row containing any missings" would be much more difficult, since sklearn assumes throughout that the rows stay in a fixed order and are neither dropped nor added. You might be able to make use of the `imblearn` package and its resampling pipelines, but it would be a little hacky. — Ben Reiniger, Jan 06 '22 at 18:58
Ahhh.. look like i miss the whole point of imputer.. yeah you are right .. it wouldn't be an imputer. — Espoir Murhabazi, Jan 06 '22 at 19:01

user4718221 · Answer 1 · 2022-01-09T08:26:56.983

I would suggest using autoimpute library. It's probably the best tool currently to deal with datasets that have missing values.

It has a function that does exactly what you asked, deletes rows with any missing values.

from autoimpute.imputations import MiceImputer, SingleImputer, listwise_delete

listwise_delete(df, inplace=True, verbose=False)

In general, sklearn's imputer is very limited in its usefulness and autoimpute is able to fill a lot of gaps. More specifically, it allows to:

Explicitly set columns that you would like to treat as variables in calculating the imputed values
Set different imputation algorithms for every column or a set of columns

si_dict_col = SingleImputer(
    strategy={"gender":"categorical", "salary": "pmm", "weight": "pmm"},
    predictors={"gender": ["salary", "weight", "looks"], "salary": ["weight", "gender"])

There are built-in methods to visualize different imputation method's results

plot_imp_scatter(data_het_miss, "x", "y", "least squares")

It also follows sklearn's patterns and can be substituted for sklearn's own imputer function in the pipeline.

@EspoirMurhabazi glad to be able to help! Does my post answer your question? — user4718221, Jan 10 '22 at 02:37
I haven't tried it , once I tried it I will mark it as the answer. — Espoir Murhabazi, Jan 10 '22 at 18:50

sklearn imputer drop column with missing values

1 Answers1