5

I have a set of N data points X = {x1, ..., xn} and a set of N target values / classes Y = {y1, ..., yn}.

The feature vector for a given yi is constructed taking into account a "window" (for lack of a better term) of data points, e.g. I might want to stack "the last 4 data points", i.e. xi-4, xi-3, xi-2, xi-1 for prediction of yi.

Obviously for a window size of 4 such a feature vector cannot be constructed for the first three target values and I would like to simply drop them. Likewise for the last data point xn.

This would not be a problem, except I want this to take place as part of a sklearn pipeline. So far I have successfully written a few custom transformers for other tasks, but those cannot (as far as I know) change the Y matrix.

Is there a way to do this, that I am unaware of or am I stuck doing this as preprocessing outside of the pipeline? (Which means, I would not be able to use GridsearchCV to find the optimal window size and shift.)

I have tried searching for this, but all I came up with was this question, which deals with removing samples from the X matrix. The accepted answer there makes me think, what I want to do is not supported in scikit-learn, but I wanted to make sure.

Community
  • 1
  • 1
Matt M.
  • 529
  • 5
  • 16

3 Answers3

5

You are correct, you cannot adjust the your target within a sklearn Pipeline. That doesn't mean that you cannot do a gridsearch, but it does mean that you may have to go about it in a bit more of a manual fashion. I would recommend writing a function do your transformations and filtering on y and then manually loop through a tuning grid created via ParameterGrid. If this doesn't make sense to you edit your post with the code you have for further assistance.

David
  • 9,284
  • 3
  • 41
  • 40
  • Yeah, that's what I meant. I can't just dump my pipeline into a GridSearchCV, which I find the most convenient way of doing CV. I'm fairly certain I can get it to work manually. Thanks – Matt M. Jan 12 '16 at 10:00
  • Is it worth raising this as a feature request? Seems like it would be a common requirement (for problems with more than one output variable) – Bill Apr 29 '20 at 19:08
3

I am struggling with a similar issue and find it unfortunate that you cannot pass on the y-values between transformers. That being said, I bypassed the issue in a bit of a dirty way.

I am storing the y-values as an instance attribute of the transformers. That way I can access them in the transform method when the pipeline calls fit_transform. Then, the transform method passes on a tuple (X, self.y_stored) which is expected by the next estimator. This means I have to write wrapper estimators and it's very ugly, but it works!

Something like this:


class MyWrapperEstimator(RealEstimator):
    def fit(X, y=None):
        if isinstance(X, tuple):
            X, y = X
        super().fit(X=X, y=y)
Petio Petrov
  • 123
  • 1
  • 9
  • Could you please explain this a little bit better? I`m facing the same issue and this seems to make the job. So your transform method returns (X, self.y_stored) and a wrapper makes the connection work? Could you please provide some code? Thanks in advance. – Angelo May 07 '20 at 07:31
  • To be honest, don't even remember the context of the work I did here, but based on a quick scan of the question and my answer, I made some edits that will hopefully help. – Petio Petrov May 12 '20 at 11:19
0

For your specific example of stacking the last 4 data points, you might be able to use seglearn.

>>> import numpy as np
>>> import seglearn
>>> x = np.arange(10)[None,:]
>>> x
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> y = x
>>> new_x, new_y, _ = seglearn.transform.SegmentXY(width=4, overlap=0.75).fit_transform(x, y)
>>> new_x
array([[0, 1, 2, 3],
       [1, 2, 3, 4],
       [2, 3, 4, 5],
       [3, 4, 5, 6],
       [4, 5, 6, 7],
       [5, 6, 7, 8],
       [6, 7, 8, 9]])
>>> new_y
array([3, 4, 5, 6, 7, 8, 9])

seglearn claims to be scikit-learn-compatible, so you should be able to fit SegmentXY in the beginning of a scikit-learn pipeline. However, I have not tried it in a pipeline myself.

Charles
  • 103
  • 1
  • 6