pySpark dataframe transformations performance

Question

I recently started working with pySpark. (Before it I worked with Pandas) I want to understand how does Spark execute and optimize transformations on dataframe.

Can I make transformations one by one using one variable with dataframe?

#creating pyspark dataframe

from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row
from pyspark.sql.functions import col,lit

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])

Like here
Way of transformations #1:

df1 = df
df1 = df1.withColumn("d", lit("new value"))
df1 = df1.withColumn("b", col("b") + 2)
df1 = df1.select("a","b","d")

Or I should use all transformations in one variable assignment?
Like here
Way of transformations #2:

df2 = (
       df.withColumn("d", lit("new value"))
         .withColumn("b", col("b") + 2)
         .select("a","b","d")
)

Way #1 is more clear for me to read. I worked with the same logic with Pandas.
But as I can understand, RDD, that is under Spark dataframe - immutable
It means, that Spark will create new RDD each time, when I make variable assignment?
And from this logic I should use Way#2 for memory consuming economy?

Or maybe I should cashe dataframes? Or Spark optimizes this steps?

It will be great to understand, how does Koalas work in this case too

Can you respond to the answer pls? happy new year.. – thebluephantom Jan 01 '22 at 17:37 — thebluephantom, Jan 01 '22 at 17:37

thebluephantom · Answer 1 · 2021-12-31T19:48:05.143

It is a matter of opinion in terms of style, but Spark uses 'lazy evaluation' and hence prior to 'Action' execution it will 'fuse' 'transformations' to an optimum of what can be done per 'Stage'.

This is aka 'Wholestage Codegen'. If applicable, each physical transformation, operator producing logic, code are fused together into a single Java function that is compiled and run per Stage.

Nothing to do with pyspark perse.

See https://www.waitingforcode.com/apache-spark-sql/why-code-generation-apache-spark-sql/read

pySpark dataframe transformations performance

1 Answers1