Optimized way to apply transformation on several columns of a Spark DataFrame

Question

In my spark jobs, I have to make transformations on multiple column for 2 use cases :

Casting columns

In my personal use case, i use it on a Df of 150 columns

  def castColumns(inputDf: DataFrame, columnsDefs: Array[(String, DataType)]): DataFrame = {
    columnsDefs.foldLeft(inputDf) {
      (acc, col) => acc.withColumn(col._1, inputDf(col._1).cast(col._2))
    }
  }

Transformation

In my personal use case, i use it to perform calculation n multiple column to create n new columns (1 input col for 1 output col, n times)

    ListOfCol.foldLeft(dataFrame) {
      (tmpDf, m) => 
          tmpDf.withColumn(addSuffixToCol(m), UDF(m))
    }

As you saw, I use FoldLeft method and withColumn. But i found out recently in the documentation that using withColumn is not that good when used multiple times :

this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.

I also found that foldleft slowdown spark application because a full plan analysis is performed on every iteration. i think this is true beacause since i added foldleft in my code, my spark take more time to start a job than before.

Is there good practice when applying transformations on multiple columns ?

Spark version : 2.2 Language : Scala

Jarrod Baker · Answer 1 · 2021-11-09T11:33:38.880

0

In the case of casting you can achieve what you're looking for with something like the following:

val df: DataFrame = ???
val cols = Array(("a", StringType), ("b", BooleanType), ("c", LongType)).map(c => col(c._1).cast(c._2))
val renamed = df.select(cols:_*)

It uses the method select(cols: Column*): DataFrame (Spark 2.2 docs) which takes a collection of Columns. The map on the variable cols creates the column expressions.

In the case of the transformations, it's not entirely clear to me what your doing, but a similar logic can be applied. I've made some best guesses regarding type signatures from your example:

def addSuffixToCol(c: Column): String = ???
def UDF(c: Column): Column = ???
val ListOfCol: List[Column] = ???
val dataFrame: DataFrame = ???
dataFrame.select(ListOfCol.map(c => UDF(c).as(addSuffixToCol(c))):_*)

As above, we apply the transformations on the columns in ListOfCol which are then used to select from dataFrame.

If you want to include other columns, add them to the select statement, eg:

dataFrame.select(col("foo"), col("bar"), ListOfCol.map(c => UDF(c).as(addSuffixToCol(c))):_*)

edited Nov 09 '21 at 11:33

answered Nov 02 '21 at 08:26

Jarrod Baker

1,150
8
13

What if i dont want to cast all columns of my dataframe ? In your example, i columns that are not casted are not in the DF after the cast – Marwan02 Nov 09 '21 at 10:59
@Marwan02: I've updated the answer to include that scenario. In short you are just dealing with a select statement, so you can select columns as you normally would. – Jarrod Baker Nov 09 '21 at 11:34
There is no way to do it without specifying all columns like so : col("foo"), col("bar") ? I have like 400 columns to add – Marwan02 Nov 09 '21 at 12:40
I tried `dataFrame.select(dataFrame.columns.diff(ListOfCol).map(c => col(c)) +: ListOfCol.map(c => UDF(c).as(addSuffixToCol(c))):_*) ` But i get overloaded method value select with alternatives – Marwan02 Nov 09 '21 at 12:51
You're almost there: you just need to give the compiler a little more help. Make sure that the two lists are joined in the `select` before you unpack them with as varargs. `dataFrame.select( (dataFrame.columns.diff(ListOfCol).map(col) :+ ListOfCol.map(c => UDF(c).as(addSuffixToCol(c)))):_*)` – Jarrod Baker Nov 09 '21 at 13:28
I still have `overloaded method value select with alternatives: [U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] (col: String,cols: String*)org.apache.spark.sql.DataFrame (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame ` – Marwan02 Nov 09 '21 at 14:03
The error is telling you that the compiler can't figure out the type of the input. Try and break things down to figure out why the error is occuring: `val myCols: List[Column] = dataFrame.columns.diff(ListOfCol).map(col) :+ ListOfCol.map(c => UDF(c).as(addSuffixToCol(c)); dataFrame.select(myCols:_*)`. If `myCols` has the right type, the next statement should work. – Jarrod Baker Nov 09 '21 at 15:18
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239069/discussion-between-marwan02-and-jarrod-baker). – Marwan02 Nov 10 '21 at 09:01

Optimized way to apply transformation on several columns of a Spark DataFrame

1 Answers1