Training/Test data with SparkML in Scala

Question

I've been facing with an issue for the past couple of hours. In theory, when we split data for training and testing, we should standardize the data for training independently, so as not to introduce bias, and then after having trained the model do we standardize the test set using the same "parameter" values as for the training set.

So far I've only managed to do it without the pipeline, looking like this:

val training = splitData(0)
val test = splitData(1)
val assemblerTraining = new VectorAssembler()
 .setInputCols(training.columns)
 .setOutputCol("features")   
val standardScaler = new StandardScaler()
 .setInputCol("features")
 .setOutputCol("normFeatures")
 .setWithStd(true)
 .setWithMean(true)
val scalerModel = standardScaler.fit(training)
val scaledTrainingData = scalerModel.transform(training)
val scaledTestData = scalerModel.transform(test)

How would I go about implementing this with pipelines? My issue is that if I create a pipeline like so:

        val pipelineTraining = new Pipeline()
            .setStages(
                Array(
                    assemblerTraining,
                    standardScaler,
                    lr
                )
            )

where lr is a LinearRegression, then there is no way to actually access the scaling model from inside the pipeline.

I've also thought of using an intermediary pipeline to do the scaling like so:

val pipelineScalingModel = new Pipeline()
 .setStages(Array(assemblerTraining, standardScaler))
 .fit(training)

val pipelineTraining = new Pipeline()
 .setStages(Array(pipelineScalingModel,lr))

val scaledTestData = pipelineScalingModel.transform(test)

But I don't know if this is the right way of going about it.

Any suggestions would be greatly appreciated.

Aron Latis · Answer 1 · 2021-11-21T14:23:33.783

In case anybody else meets with this issue, this is how I proceeded:

I realized I was not allowed to modify the [forbiddenColumnName] variable.Therefore I gave up on trying to use pipelines in that phase. I created my own standardizing function and called it for each individual feature, like so:

def standardizeColumn( dfTrain : DataFrame, dfTest : DataFrame, columnName : String) : Array[DataFrame] = {
   val withMeanStd = dfTrain.select(mean(col(columnName)), stddev(col(columnName))).collect
   val auxDFTrain = dfTrain.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(0))/withMeanStd(0).getDouble(1))
   val auxDFTest = dfTest.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(1))/withMeanStd(0).getDouble(1))
        Array(auxDFTrain, auxDFTest)
}

for (columnName <- training.columns){
   if ((columnName != [forbiddenColumnName]) && (columnExists(training, columnName))){
       val auxResult = standardizeColumn(training, test, columnName)
       training = auxResult(0)
       test = auxResult(1)
   }
}

[MENTION] My number of variables is very low ~15, therefore this is not a very lenghty process. I seriously doubt this would be the right way of going about things on much bigger datasets.

Training/Test data with SparkML in Scala

1 Answers1