2

Issue

I am using the feature-engine library, and am finding that when I create an sklearn Pipeline that uses the SklearnTransformerWrapper to wrap a OneHotEncoder, I get the following error when trying to run cross-validation:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
...
9 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.

Below are more details about the failures:
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

If I do things the "old way" with an sklearn ColumnTransformer, I do not get the error. I also don't get errors if I either: A) Score without cross-validation or B) Don't use the categorical features (i.e. remove the one-hot encoding).

Is this an issue with SklearnTransformerWrapper or am I using it the wrong way?

Code

Here is the Pipeline setup with SklearnTransformerWrapper that fails. It will work successfully if we don't use the categorical features, or if we don't do cross-validation (see comments in code):

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression

from feature_engine.wrappers import SklearnTransformerWrapper
from feature_engine.selection import DropFeatures


pipeline_new = Pipeline(steps=[
    ("scale_b_c", SklearnTransformerWrapper(
            transformer=StandardScaler(), 
            variables=["b", "c"]
        )
    ),
    
    # Comment out this step for cross-validation to not fail
    ("encode_a_d", SklearnTransformerWrapper(
            transformer=OneHotEncoder(drop="first", sparse=False), 
            variables=["a", "d"]
        )
    ),
    
    ("cleanup", DropFeatures(["a", "d"])),
    ("model", LinearRegression())
])

# Defined later (putting main example up front)
# Set cv to False to successfully score entire training set
do_test(df, pipeline_new, cv=True)

Here is the "old-style" pipeline that uses ColumnTransformer instead; it works correctly:

from sklearn.compose import ColumnTransformer


pipeline_old = Pipeline(steps=[
    (
        "xform", ColumnTransformer([
            ("cat", OneHotEncoder(drop="first"), ["a", "d"]),
            ("num", StandardScaler(), ["b", "c"])
        ])
    ),
    ("model", LinearRegression())
])

# Defined later (putting main example up front)
do_test(df, pipeline_old, cv=True)

Supporting code: implementation of the do_test() test function:

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# do_test() implementation
def do_test(df, pipeline, cv=True):
    X = df.drop(columns=["y"])
    y = df[["y"]]
       
    if cv:
        return cross_val_score(pipeline, X, y, scoring="neg_mean_squared_error", cv=10)
    else:
        pipeline.fit(X, y)
        y_pred = pipeline.predict(X)        
        return mean_squared_error(y, y_pred)

Supporting code: sample data creation.

import pandas as pd
import numpy as np

# Create sample data
n = 20000
df = pd.DataFrame({
    "a": [["alpha", "beta", "gamma", "delta"][np.random.randint(4)] for i in range(n)],
    "b": [np.random.random() * 100 for i in range(n)],
    "c": [np.random.random() * 200 for i in range(n)],
    "d": [["east", "west"][np.random.randint(2)] for i in range(n)],
})

def make_y(x):
    add_1 = 100 if x.a in ["alpha", "beta"] else 200
    add_2 = 100 if x.d in ["east"] else 300

    return 2 * x.b + 3 * x.c + 2 * add_1 + 5 * add_2 + np.random.normal(10)

df["y"] = df.apply(make_y, axis=1)

Note: I am not doing train/test separation, in order to keep the question shorter.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
sparc_spread
  • 10,643
  • 11
  • 45
  • 59

2 Answers2

3

It is simple enough to verify that "encode_a_d" step in the pipe with SklearnTransformerWrapper produces NaNs during cross-validation:

kf = KFold(n_splits = 10)

for train_index, test_index in kf.split(X):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    X_train_pipe = pipeline_new["encode_a_d"].fit_transform(pipeline_new["scale_b_c"].fit_transform(X_train))
    X_test_pipe = pipeline_new["encode_a_d"].fit_transform(pipeline_new["scale_b_c"].fit_transform(X_test))
    print(X_train_pipe.isnull().any().any(), X_test_pipe.isnull().any().any())

It seems that it doubles the number of rows, and puts NaNs for features ['b', 'c'] where one-hot-encoded features formed from ['a', 'd'] have their usual values, and vice versa. As to why this happens - I have no idea, feature-engine may be at fault, but from my experience it may well be some wicked trickery on cross_val_score's part.

1

The description of the output that @AlwaysRightNeverLeft gives suggests an issue with the indexes: when cross-validating, the dataframes will have nonstandard indexes, and when SklearnTransformerWrapper merges the one-hot encoded array to the original data, it does an "outer join".

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • Thanks for diagnosis here: https://github.com/feature-engine/feature_engine/issues/368#issuecomment-1026315846 Taking a look now – sparc_spread Feb 01 '22 at 21:04