I have a pandas dataframe that has some NaN values in a particular column:
1291 NaN
1841 NaN
2049 NaN
Name: some column, dtype: float64
And I have made the following pipeline in order to deal with it:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()
pipe = Pipeline([('imputer', imputer),
('scaler', scaler),
('logistic', logistic)])
Now when I pass this pipeline to a RandomizedSearchCV
, I get the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
It's actually quite a bit longer than that -- I can post the entire error in an edit if neccesary. Anyway, I am quite sure that this column is the only column that contains NaNs. Moreover, if I switch from SimpleImputer
to the (now deprecated) Imputer
in the pipeline, the pipeline works just fine in my RandomizedSearchCV
. I checked the documentation, but it seems that SimpleImputer
is supposed to behave in (nearly) the exact same way as Imputer
. What is the difference in behavior? How do use an imputer in my pipeline without using the deprecated Imputer
?