I have a raw Spark Dataframe DF
. Let's assume a simple scenario in which I want to preprocess and transform it in a few ways, and then finally draw two plots.
My pseudocode would look like this:
DF = spark.read.csv('foo.csv')
DF = preprocess(DF)
result_1 = some_aggregations(DF).toPandas()
result_2 = some_different_aggregations(DF).toPandas()
Now, If I understood correctly, e.g. from the accepted answer here, then:
preprocess(DF)
is not actually run in line 2.- The disadvantage is that
preprocess(DF)
is run twice - once forresult_1
, and once forresult_2
.
Is this disadvantage (redundant/multiple computation) a generally accepted disadvantage of the Spark way? Or am I processing my data inefficiently?