0

I have a raw Spark Dataframe DF. Let's assume a simple scenario in which I want to preprocess and transform it in a few ways, and then finally draw two plots.

My pseudocode would look like this:

DF = spark.read.csv('foo.csv')
DF = preprocess(DF)

result_1 = some_aggregations(DF).toPandas()
result_2 = some_different_aggregations(DF).toPandas()

Now, If I understood correctly, e.g. from the accepted answer here, then:

  • preprocess(DF) is not actually run in line 2.
  • The disadvantage is that preprocess(DF) is run twice - once for result_1, and once for result_2.

Is this disadvantage (redundant/multiple computation) a generally accepted disadvantage of the Spark way? Or am I processing my data inefficiently?

Alexander Engelhardt
  • 1,632
  • 3
  • 16
  • 31

0 Answers0