Predicting new data using sklearn after standardizing the training data

Question

I am using Sklearn to build a linear regression model (or any other model) with the following steps:

X_train and Y_train are the training data

Standardize the training data

  X_train = preprocessing.scale(X_train)

fit the model
```
 model.fit(X_train, Y_train)
```

Once the model is fit with scaled data, how can I predict with new data (either one or more data points at a time) using the fit model?

What I am using is

Scale the data

NewData_Scaled = preprocessing.scale(NewData)

Predict the data

PredictedTarget = model.predict(NewData_Scaled)

I think I am missing a transformation function with preprocessing.scale so that I can save it with the trained model and then apply it on the new unseen data? any help please.

score 37 · Accepted Answer · edited Oct 18 '21 at 14:59

37

Take a look at these docs.

You can use the StandardScaler class of the preprocessing module to remember the scaling of your training data so you can apply it to future values.

from sklearn.preprocessing import StandardScaler
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = StandardScaler().fit(X_train)

scaler has calculated the mean and scaling factor to standardize each feature.

>>>scaler.mean_
array([ 1. ...,  0. ...,  0.33...])
>>>scaler.scale_                                       
array([ 0.81...,  0.81...,  1.24...])

To apply it to a dataset:

import numpy as np

X_train_scaled = scaler.transform(X_train)
new_data = np.array([-1.,  1., 0.])    
new_data_scaled = scaler.transform(new_data)
>>>new_data_scaled
array([[-2.44...,  1.22..., -0.26...]])

edited Oct 18 '21 at 14:59

MertG

753
1
6
22

answered Aug 05 '16 at 06:44

ilyas patanam

5,116
2
29
33

This is a useful answer -- I was wondering if the StandardScaler could also be used on the new unseen data. Great to see that's the case. – Monica Heddneck Aug 08 '17 at 10:10
I think the that the argument for the `transform` function should be 2d array. In your case the `new_data` array is 1d. – dim Apr 03 '18 at 13:04
6

but this would only work in the same session, right? is there any way to save the scaler for a later session, like you can save the model/weights of a trained neural network? – J.D Aug 21 '19 at 14:39
@J.Dahlgren did you get around by being able to save the scaler? I am running into a similar issue. – Regressor Jul 25 '20 at 21:32
1

@Regressor If it's still relevant for you, here is a solution using the joblib package to export/import the standard scaler object: https://stackoverflow.com/a/53153373/11537601 – Peter Schindler Sep 23 '20 at 01:22

Prajot Kuvalekar · Answer 2 · 2021-07-30T05:21:56.483

Above answer is OK when you have use train data and test data in single run...
But what if you want to test or infer after training

This will surely help

from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data 

sc = StandardScaler()
sc.fit(X)
x = sc.transform(X)
#On new data, though data count is one but Features count is still Four
sc.transform(np.array([[6.5, 1.5, 2.5, 6.5]]))  # to verify the last returned output



std  = np.sqrt(sc.var_)
np.save('std.npy',std )
np.save('mean.npy',sc.mean_)

This block is independent

s = np.load('std.npy')
m = np.load('mean.npy')
(np.array([[6.5, 1.5, 2.5, 6.5]] - m)) / s   # z = (x - u) / s ---> Main formula
# will have same output as above

Predicting new data using sklearn after standardizing the training data

2 Answers2

This block is independent