In my spark jobs, I have to make transformations on multiple column for 2 use cases :
- Casting columns
In my personal use case, i use it on a Df of 150 columns
def castColumns(inputDf: DataFrame, columnsDefs: Array[(String, DataType)]): DataFrame = {
columnsDefs.foldLeft(inputDf) {
(acc, col) => acc.withColumn(col._1, inputDf(col._1).cast(col._2))
}
}
- Transformation
In my personal use case, i use it to perform calculation n multiple column to create n new columns (1 input col for 1 output col, n times)
ListOfCol.foldLeft(dataFrame) {
(tmpDf, m) =>
tmpDf.withColumn(addSuffixToCol(m), UDF(m))
}
As you saw, I use FoldLeft method and withColumn. But i found out recently in the documentation that using withColumn is not that good when used multiple times :
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.
I also found that foldleft slowdown spark application because a full plan analysis is performed on every iteration. i think this is true beacause since i added foldleft in my code, my spark take more time to start a job than before.
Is there good practice when applying transformations on multiple columns ?
Spark version : 2.2 Language : Scala