I am doing the following:
def make_trans(verbose=False):
ct = ColumnTransformer(
[
('num', StandardScaler(), num_cols),
('cat', TestEncoder(), cat_cols)
], verbose=verbose
)
return ct
def make_pipe(clf, verbose=False):
ct = make_trans(verbose)
pipe = Pipeline([("transformer", ct), ("classifier", clf)], verbose=verbose)
return pipe
lr3 = LogisticRegression()
lr3p = make_pipe(lr3)
scores = cross_val_score(lr3p, df, target, cv=cvFoldsNo, error_score="raise")
But it gives me this error: "ValueError: Found unknown categories ['some_val'] in column 7 during transform"
I have made custom transformer class on top of OneHotEncoder which just prints shapes of input params and calls base class. Shape of input dataset is (32561, 14). During cross_val_score call i am getting this:
fit_transform (26048, 8)
fit (26048, 8)
transform (26048, 8)
transform (6513, 8)
As i understand, this means that OneHotEncoder is never trained on whole dataset. And dataset is separated in such a way that trained on part (26048,8) does not have this 'some_val' value but part used in last transform (6513,8) - have it.
What is the proper way to use this encoder with pipeline and cross_val_score ?