scikit-learn custom transformer / pipeline that changes X and Y

Question

I have a set of N data points X = {x₁, ..., x_n} and a set of N target values / classes Y = {y₁, ..., y_n}.

The feature vector for a given y_i is constructed taking into account a "window" (for lack of a better term) of data points, e.g. I might want to stack "the last 4 data points", i.e. x_i-4, x_i-3, x_i-2, x_i-1 for prediction of y_i.

Obviously for a window size of 4 such a feature vector cannot be constructed for the first three target values and I would like to simply drop them. Likewise for the last data point x_n.

This would not be a problem, except I want this to take place as part of a sklearn pipeline. So far I have successfully written a few custom transformers for other tasks, but those cannot (as far as I know) change the Y matrix.

Is there a way to do this, that I am unaware of or am I stuck doing this as preprocessing outside of the pipeline? (Which means, I would not be able to use GridsearchCV to find the optimal window size and shift.)

I have tried searching for this, but all I came up with was this question, which deals with removing samples from the X matrix. The accepted answer there makes me think, what I want to do is not supported in scikit-learn, but I wanted to make sure.

https://stackoverflow.com/a/70191787/10375049 – Marco Cerliani Dec 02 '21 at 09:12 — Marco Cerliani, Dec 02 '21 at 09:12

score 5 · Accepted Answer · answered Jan 11 '16 at 22:36

5

You are correct, you cannot adjust the your target within a sklearn Pipeline. That doesn't mean that you cannot do a gridsearch, but it does mean that you may have to go about it in a bit more of a manual fashion. I would recommend writing a function do your transformations and filtering on y and then manually loop through a tuning grid created via ParameterGrid. If this doesn't make sense to you edit your post with the code you have for further assistance.

answered Jan 11 '16 at 22:36

David

9,284
3
41
40

Yeah, that's what I meant. I can't just dump my pipeline into a GridSearchCV, which I find the most convenient way of doing CV. I'm fairly certain I can get it to work manually. Thanks – Matt M. Jan 12 '16 at 10:00
Is it worth raising this as a feature request? Seems like it would be a common requirement (for problems with more than one output variable) – Bill Apr 29 '20 at 19:08

Petio Petrov · Answer 2 · 2020-05-12T11:18:32.810

3

I am struggling with a similar issue and find it unfortunate that you cannot pass on the y-values between transformers. That being said, I bypassed the issue in a bit of a dirty way.

I am storing the y-values as an instance attribute of the transformers. That way I can access them in the transform method when the pipeline calls fit_transform. Then, the transform method passes on a tuple (X, self.y_stored) which is expected by the next estimator. This means I have to write wrapper estimators and it's very ugly, but it works!

Something like this:


class MyWrapperEstimator(RealEstimator):
    def fit(X, y=None):
        if isinstance(X, tuple):
            X, y = X
        super().fit(X=X, y=y)

edited May 12 '20 at 11:18

answered Feb 08 '18 at 21:41

Petio Petrov

123
1
9

Could you please explain this a little bit better? I`m facing the same issue and this seems to make the job. So your transform method returns (X, self.y_stored) and a wrapper makes the connection work? Could you please provide some code? Thanks in advance. – Angelo May 07 '20 at 07:31
To be honest, don't even remember the context of the work I did here, but based on a quick scan of the question and my answer, I made some edits that will hopefully help. – Petio Petrov May 12 '20 at 11:19

score 0 · Answer 3 · answered Dec 18 '19 at 23:00

For your specific example of stacking the last 4 data points, you might be able to use seglearn.

>>> import numpy as np
>>> import seglearn
>>> x = np.arange(10)[None,:]
>>> x
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> y = x
>>> new_x, new_y, _ = seglearn.transform.SegmentXY(width=4, overlap=0.75).fit_transform(x, y)
>>> new_x
array([[0, 1, 2, 3],
       [1, 2, 3, 4],
       [2, 3, 4, 5],
       [3, 4, 5, 6],
       [4, 5, 6, 7],
       [5, 6, 7, 8],
       [6, 7, 8, 9]])
>>> new_y
array([3, 4, 5, 6, 7, 8, 9])

seglearn claims to be scikit-learn-compatible, so you should be able to fit SegmentXY in the beginning of a scikit-learn pipeline. However, I have not tried it in a pipeline myself.

scikit-learn custom transformer / pipeline that changes X and Y

3 Answers3