How to avoid doing the same transformations multiple times in Spark DataFrames?

Asked Oct 10 '18 at 08:36

Active Oct 10 '18 at 08:36

Viewed 802 times

I have a raw Spark Dataframe DF. Let's assume a simple scenario in which I want to preprocess and transform it in a few ways, and then finally draw two plots.

My pseudocode would look like this:

DF = spark.read.csv('foo.csv')
DF = preprocess(DF)

result_1 = some_aggregations(DF).toPandas()
result_2 = some_different_aggregations(DF).toPandas()

Now, If I understood correctly, e.g. from the accepted answer here, then:

preprocess(DF) is not actually run in line 2.
The disadvantage is that preprocess(DF) is run twice - once for result_1, and once for result_2.

Is this disadvantage (redundant/multiple computation) a generally accepted disadvantage of the Spark way? Or am I processing my data inefficiently?

asked Oct 10 '18 at 08:36

Alexander Engelhardt

1,632
3
16
31

2

For such cases, to avoid recomputation, we use `cache()`. Check [this answer](https://stackoverflow.com/a/28983767/7045987). – mayank agrawal Oct 10 '18 at 09:03
1

`DF = preprocess(DF).persist()` is what you're looking for. – Tim Oct 10 '18 at 21:16

How to avoid doing the same transformations multiple times in Spark DataFrames?

0 Answers0