This is a very small sklearn snipplet:
logistic = linear_model.LogisticRegression()
pipe = Pipeline(steps=[
('scaler_2', MinMaxScaler()),
('pca', decomposition.NMF(6)),
('logistic', logistic),
])
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
pipe.fit(Xtrain, ytrain)
ypred = pipe.predict(Xtest)
I will get this error:
raise ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to NMF (input X)
According to this question: Scaling test data to 0 and 1 using MinMaxScaler
I know this is because
This is due to the fact that the lowest value in my test data was lower than the train data, of which the min max scaler was fit
But I am wondering, is this a bug? MinMaxScaler (all scalers) seems should be applied before I do the prediction, it should not depends on previous fitted training data, am I right?
Or how could I correctly use preprocessing scalers with Pipeline?
Thanks.