I've a question about stacking multiple sklearn SimpleImputers in a Pipeline:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
pipeline = Pipeline([
('si1', SimpleImputer(missing_values = np.nan,
strategy='constant',
fill_value=-1)),
('si2', SimpleImputer(missing_values = None,
strategy='constant',
fill_value=-1))
])
train = pd.DataFrame({'f1': [True, 1, 0], 'f2': [None,None,None]})
test1 = pd.DataFrame({'f1': [0, False, 0], 'f2': [np.nan, np.nan, np.nan]})
test2 = pd.DataFrame({'f1': [0, 0, 0], 'f2': [np.nan, np.nan, np.nan]})
pipeline.fit_transform(train)
pipeline.transform(test1)
pipeline.transform(test2)
The code works fine for transforming test1 (which contains a Boolean value), but fails for test2 with:
ValueError: 'X' and 'missing_values' types are expected to be both numerical. Got X.dtype=float64 and type(missing_values)=<class 'NoneType'>.
Apparently, in the presence of a string or Boolean value, the transformation works fine, but it fails when there are only numerical values.
Another weird behavior is when I switch the order of the imputers inside the Pipeline:
pipeline = Pipeline([
('si2', SimpleImputer(missing_values = None,
strategy='constant',
fill_value=-1)),
('si1', SimpleImputer(missing_values = np.nan,
strategy='constant',
fill_value=-1))
])
In this case, the transformations for test1 and test 2 fail with the following errors respectively:
ValueError: Input contains NaN
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I'm aware of the fact that these types of transformations can be easily done using pandas.DataFrame.replace function. But I'm confused by the behavior and appreciate an explanation of what's going on in each of these scenarios.