Does Spark do one or multiple passes through data when multiple withColumn
functions are chained?
For example:
val dfnew = df.withColumn("newCol1", f1(col("a")))
.withColumn("newCol2", f2(col("b")))
.withColumn("newCol3", f3(col("c")))
where
df
is my inputDataFrame
containing at least columns a, b, cdfnew
is outputDataFrame
with three new columns newCol1, newCol2, newCol3f1
,f2
,f3
are some user defined functions or some spark operations on columns like cast, etc In my project I can have even 30 independentwithColumn
function chained withfoldLeft
.
Important
I am assuming here that f2
does not depend on result of f1
, while f3
does not depend on result of f1
and f2
. The functions could be performed in any order. There is no shuffle in any function
My observations
- all functions are in the same stage
- addition of new
withColumn
does not increase execution time in such a way to suspect additional passages through data. - I have tested for example single
SQLTransformer
with select statement containing all functions vs multiple separateSQLTransformer
one for each function and the execution time was similar.
Questions
- Will spark make one or three passages through the data, once for each
withColumn
? - Does it depend on the type of functions
f1
,f2
,f3
? UDF vs generic Spark operations? - If the functions
f1
,f2
,f3
are inside the same stage, does it mean they are in the same data pass? - Does number of passages depend on shuffles within functions? If there is no shuffle?
- If I chain the
withColumn
functions withfoldLeft
will it change number of passages? - I could do something similar with three
SQLTransformers
or just oneSQLTransformer
with all three transformations in the same select_statement. How many passes through data that would do? - Basically it doesn't matter, the time of execution will be similar for 1 and 3 passages?