2

I have a DataFrame of numerical features that I need to standarize. To do so I am using python MinMaxScaler to perform the following operations on all columns of the DataFrame:

X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

Now I am thinking to do this using Scala. One way is to use MinMaxScaler in Scala but it generates an array of features and store it as a new column. How can I use MinMaxScaler and still have multiple columns of scaled features?

bottaio
  • 4,963
  • 3
  • 19
  • 43
user3284804
  • 121
  • 2

1 Answers1

0

That's true - MinMaxScaler works on vector type. But you can easily turn that into single-value column and get what you want. You can work on single column at a time - scale each one and get scaled data frame back. Here's how to approach that:

val columns = df.columns

// for each column turn it into DenseVector and apply MinMaxScaler
val steps = columns.flatMap { column => Array(
    new VectorAssembler().setInputCols(Array(column)).setOutputCol(s"${column}_feature"), 
    new MinMaxScaler().setInputCol(s"${column}_feature").setOutputCol(s"${column}_scaled")
)}

// apply transformation
val pipeline = new Pipeline().setStages(steps)
val scaledDf = pipeline.fit(df).transform(df)

// helper UDF function
val headValue = udf((vec: DenseVector) => vec(0))

// rename scaled column to original column name
scaledDf
    .select(columns.map(column => headValue(col(s"${column}_scaled")).alias(column)): _*)
    .show()
bottaio
  • 4,963
  • 3
  • 19
  • 43