3

This is a very small sklearn snipplet:

logistic = linear_model.LogisticRegression()

pipe = Pipeline(steps=[
    ('scaler_2', MinMaxScaler()),
    ('pca',  decomposition.NMF(6)),     
    ('logistic', logistic),
])

from sklearn.cross_validation import train_test_split   

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)

pipe.fit(Xtrain, ytrain)    
ypred = pipe.predict(Xtest)

I will get this error:

    raise ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to NMF (input X)

According to this question: Scaling test data to 0 and 1 using MinMaxScaler

I know this is because

This is due to the fact that the lowest value in my test data was lower than the train data, of which the min max scaler was fit

But I am wondering, is this a bug? MinMaxScaler (all scalers) seems should be applied before I do the prediction, it should not depends on previous fitted training data, am I right?

Or how could I correctly use preprocessing scalers with Pipeline?

Thanks.

Community
  • 1
  • 1
Bear Huang
  • 33
  • 1
  • 5

2 Answers2

3

This is not a bug. The main reason that you add the scaler to the pipeline is to prevent leaking the information from your test set to your model. When you fit the pipeline to your training data, the MinMaxScaler keeps the min and max of your training data. It will use these values to scale any other data that it may see for prediction. As you also highlighted, this min and max are not necessarily the min and max of your test data set! Therefore you may end up having some negative values in your training set when the min of your test set is smaller than the min value in the training set. You need a scaler that does not give you negative values. For instance, you may usesklearn.preprocessing.StandardScaler. Make sure that you set the parameter with_mean = False. This way, it will not center the data before scaling but scales your data to unit variance.

MhFarahani
  • 960
  • 2
  • 9
  • 19
  • I understood some of your point, but I still can't figure out, if I use StandardScaler to solve this problem, the data is NOT exactly the same as I want. If I use StandardScaler instead or add it after MinMaxScaler, does the performance of final classifier be effected? Thanks. – Bear Huang Aug 29 '16 at 15:37
  • @BearHuang, Do you have to use the `MinMaxScaler`? Do you have negative values in your data set? There is no guaranty that you have a min in your test set that is smaller that any unseen data! If you use the `StandardScaler(with_mean=False)` it will divide your data with standard deviation (`X/ np.sqrt(var)`). The scaled data will not be between 0 and 1, but still better than unscaled data. The only issue is that all your data needs to be positive before scaling. – MhFarahani Aug 30 '16 at 17:17
  • If I have to make data be between 0 and 1, how could I correctly use sklearn? Thanks. – Bear Huang Aug 31 '16 at 02:03
0

If your data is stationary and sampling is done properly, you can assume that your test set resembles your train set to some big extent.

Therefore, you can expect that min/max over test set is close to min/max over train set, with exceptions to few "outliers".

To decrease chances of producing negative values with MinMaxScaler on test set, simply scale your data not to (0,1) range, but ensure that you have allowed some "safety space" for your transformer like this:

MinMaxScaler(feature_range=(1,2))
Anatoly Alekseev
  • 2,011
  • 24
  • 27