0

I am learning how to prepare data, build estimators and check using a train/test data split.

My question is how I can prepare the test dataset correctly.

I split my data into a test and a training set. And as "Hands on with machine learning with Scikit-Learn" teaches me, I set up a pipeline for my data preparation:

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])

After training my estimator, I want to use my trained estimator on test data to validate my accuracy. However if I pass my test feature data through the pipeline I defined, isn't it calculating a new median value from only the test dataset and the std_scalar based on the test dataset which will be different values to what were arrived at in the training dataset?

I presume for consistency I want to re-use the variables achieved during training. That is what the estimator has been fitted on. For example, if the test set was just a single row (or in production I have a single input I want to derive a prediction from), then the median values wouldn't even be achievable if the single input has a NaN!

What step am I missing?

kevins_1
  • 1,268
  • 2
  • 9
  • 27
talkingtoaj
  • 848
  • 8
  • 27
  • The pipeline above will fit the scaling parameters on the training set and use those to transform both the training set and test set, which is the correct way of doing it. – user2653663 Aug 21 '19 at 13:18

1 Answers1

0

you must keep in mind, what is happening:

Imagen you have the following dataset (input features):

data = [[0, 1], [1, 0], [1, 0], [1, 1]]
scaler = StandardScaler()
scaler.fit(data)
print(scaler.mean_)
[0.75 0.55]
print(scaler.transform(data))
[[-1.73205081  1.        ]
 [ 0.57735027 -1.        ]
 [ 0.57735027 -1.        ]
 [ 0.57735027  1.        ]]

but now if you only use (what you are doing in your approach):

data = [[0, 1], [1, 0]]
data2 = [[1,0], [1,1]]
scaler = StandardScaler()
scaler.fit(data)
print(scaler.mean_)
[0.5 0.5]
print(scaler.transform(data2))
[[ 1. -1.]
 [ 1.  1.]]

but as test data is named: keep the data completly untouched before you run your algorithm.

https://stats.stackexchange.com/questions/267012/difference-between-preprocessing-train-and-test-set-before-and-after-splitting

PV8
  • 5,799
  • 7
  • 43
  • 87
  • 1
    You definitely shouldn't use any preprocessing on your test data, since that defeats the point of a separate test set. The problem you highlight above is both a problem with how the data is split and the amount of data available. – user2653663 Aug 21 '19 at 13:17