I have a use case for FunctionTransform where training examples need to be sorted along with their true labels based on some criterion.
def sort_examples(X, y=None):
Xt, indices = zip(*map(itemgetter(1, 2),
sorted([(x.nnz, x, i) for i, x in
enumerate(X)], key = itemgetter(0))))
if y is not None:
yt = [yy[idx] for idx in indices]
return(Xt, yt)
classifier = Pipeline(steps=[
('sorter', FunctionTransformer(func=sort_examples,
validate=False,
accept_sparse=True,
pass_y=True)),
('classifier', DummyClassifier())])
The problem is when I embed FunctionTransform instance in Pipeline which wraps my implementation function by passing pass_y = True
(since y needs to be transformed too), the Pipeline will intentionally drop y by calling <FunctionTransform instance>.fit(x, y).transform(x) without returning transformed y.
As a consequence of that, training examples are transformed and sorted but not associating true labels.
My current work is that patch FunctionTransform with fit_transform and by pass calling sklearn.FunctionTransform.transform method explicitly but implicitly within fit_transform body to enforce y is transformed as well.
I’m not sure if this use case is legitimate for what FunctionTransform is designed for. I’ll be deeply grateful If there are any scikit-learn experts could provide suggestions or better solution how to get training examples and corresponding labels transformed in an automatic pipeline