6

I have a pandas dataframe that has some NaN values in a particular column:

1291   NaN
1841   NaN
2049   NaN
Name: some column, dtype: float64

And I have made the following pipeline in order to deal with it:

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()

pipe = Pipeline([('imputer', imputer),
                 ('scaler', scaler), 
                 ('logistic', logistic)])

Now when I pass this pipeline to a RandomizedSearchCV, I get the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

It's actually quite a bit longer than that -- I can post the entire error in an edit if neccesary. Anyway, I am quite sure that this column is the only column that contains NaNs. Moreover, if I switch from SimpleImputer to the (now deprecated) Imputer in the pipeline, the pipeline works just fine in my RandomizedSearchCV. I checked the documentation, but it seems that SimpleImputer is supposed to behave in (nearly) the exact same way as Imputer. What is the difference in behavior? How do use an imputer in my pipeline without using the deprecated Imputer?

Marcel
  • 223
  • 1
  • 3
  • 5
  • 3
    Do you get same error if you run `SimpleImputer` independently (not from a pipeline)? – Yohanes Gultom Aug 08 '18 at 08:35
  • Finding the same error when I pass - `SimpleImputer( strategy='constant', fill_value=0)` – fixxxer Oct 09 '18 at 04:17
  • 3
    Comment by @FrédérandOuweric: Did you check that the target variable doesn't contain NaN values? The Imputer only handles missing values in the input features. – iled Dec 02 '18 at 03:17
  • I had the same issue. Turned out I explicitly had to specify missing_values=None . I would expect this to be the default behaviour, actually. – sist Dec 17 '20 at 16:19
  • This issues seems to be solved here: https://github.com/scikit-learn/scikit-learn/issues/21112 – Henry Tseng Jan 04 '22 at 20:05

2 Answers2

1

SimpleImputer in make_pipeline

preprocess_pipeline = make_pipeline(   
    FeatureUnion(transformer_list=[
        ('Handle numeric columns', make_pipeline(
            ColumnSelector(columns=['Amount']),
            SimpleImputer(strategy='constant', fill_value=0),
            StandardScaler()
        )),
        ('Handle categorical data', make_pipeline(
            ColumnSelector(columns=['Type', 'Name', 'Changes']),
            SimpleImputer(strategy='constant', missing_values=' ', fill_value='missing_value'),
            OneHotEncoder(sparse=False)
        ))
    ])
)

SimpleImputer in Pipeline

('features', FeatureUnion ([
     ('Cat Columns', Pipeline([
          ('Category Extractor', TypeSelector(np.number)),
                 ('Impute Zero', SimpleImputer(strategy="constant", fill_value=0))
                                    ])),
('Numerics', Pipeline([
      ('Numeric Extractor', TypeSelector("category")),
          ('Impute Missing', SimpleImputer(strategy="constant", fill_value='missing'))
          ]))        
     ]))
hanzgs
  • 1,498
  • 17
  • 44
0

I had the same issue but this addressed it:

imputer = SimpleImputer(strategy = 'median', fill_value = 0)
Bsquare ℬℬ
  • 4,423
  • 11
  • 24
  • 44
K.K.
  • 107
  • 2
  • 7