2

I am doing the following:

def make_trans(verbose=False):
    ct = ColumnTransformer(
        [
            ('num', StandardScaler(), num_cols),
            ('cat', TestEncoder(), cat_cols)
        ], verbose=verbose
    )
    return ct


def make_pipe(clf, verbose=False):
    ct = make_trans(verbose)
    pipe = Pipeline([("transformer", ct), ("classifier", clf)], verbose=verbose)

    return pipe

lr3 = LogisticRegression()
lr3p = make_pipe(lr3)
scores = cross_val_score(lr3p, df, target, cv=cvFoldsNo, error_score="raise")

But it gives me this error: "ValueError: Found unknown categories ['some_val'] in column 7 during transform"

I have made custom transformer class on top of OneHotEncoder which just prints shapes of input params and calls base class. Shape of input dataset is (32561, 14). During cross_val_score call i am getting this:

fit_transform (26048, 8)
fit (26048, 8)
transform (26048, 8)
transform (6513, 8)

As i understand, this means that OneHotEncoder is never trained on whole dataset. And dataset is separated in such a way that trained on part (26048,8) does not have this 'some_val' value but part used in last transform (6513,8) - have it.

What is the proper way to use this encoder with pipeline and cross_val_score ?

ilya
  • 119
  • 1
  • 13

0 Answers0