I recently started working with pySpark. (Before it I worked with Pandas) I want to understand how does Spark execute and optimize transformations on dataframe.
Can I make transformations one by one using one variable with dataframe?
#creating pyspark dataframe
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row
from pyspark.sql.functions import col,lit
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
Like here
Way of transformations #1:
df1 = df
df1 = df1.withColumn("d", lit("new value"))
df1 = df1.withColumn("b", col("b") + 2)
df1 = df1.select("a","b","d")
Or I should use all transformations in one variable assignment?
Like here
Way of transformations #2:
df2 = (
df.withColumn("d", lit("new value"))
.withColumn("b", col("b") + 2)
.select("a","b","d")
)
Way #1 is more clear for me to read. I worked with the same logic with Pandas.
But as I can understand, RDD, that is under Spark dataframe - immutable
It means, that Spark will create new RDD each time, when I make variable assignment?
And from this logic I should use Way#2 for memory consuming economy?
Or maybe I should cashe dataframes? Or Spark optimizes this steps?
It will be great to understand, how does Koalas work in this case too