I've been facing with an issue for the past couple of hours. In theory, when we split data for training and testing, we should standardize the data for training independently, so as not to introduce bias, and then after having trained the model do we standardize the test set using the same "parameter" values as for the training set.
So far I've only managed to do it without the pipeline, looking like this:
val training = splitData(0)
val test = splitData(1)
val assemblerTraining = new VectorAssembler()
.setInputCols(training.columns)
.setOutputCol("features")
val standardScaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("normFeatures")
.setWithStd(true)
.setWithMean(true)
val scalerModel = standardScaler.fit(training)
val scaledTrainingData = scalerModel.transform(training)
val scaledTestData = scalerModel.transform(test)
How would I go about implementing this with pipelines? My issue is that if I create a pipeline like so:
val pipelineTraining = new Pipeline()
.setStages(
Array(
assemblerTraining,
standardScaler,
lr
)
)
where lr is a LinearRegression, then there is no way to actually access the scaling model from inside the pipeline.
I've also thought of using an intermediary pipeline to do the scaling like so:
val pipelineScalingModel = new Pipeline()
.setStages(Array(assemblerTraining, standardScaler))
.fit(training)
val pipelineTraining = new Pipeline()
.setStages(Array(pipelineScalingModel,lr))
val scaledTestData = pipelineScalingModel.transform(test)
But I don't know if this is the right way of going about it.
Any suggestions would be greatly appreciated.