Larger number of Transformations on multiple dataframe in Spark

Question

I have a transform engine built on the spark that is metadata-driven. I perform a set of transformations on multiple data frames stored in memory in a Scala Map[String, DataFrame]. I encounter a condition where I generate a data frame using 84 transforms including(withColumn, Join, union etc). After these, the output data frame is used as an input to another set of transformations.

If I write the intermediate transformation result after first 84 transformations and then load the dataframe into the Map from the output path. The next set of transformations works fine. If I do not this, it takes 30 mins just to evaluate.

My Approach: I tried persisting the Dataframe using:

dfMap(target).cache()

But this approach did not help.

score 0 · Answer 1 · answered Jul 11 '20 at 21:00

So in these 84 transformations , how many of them are aggregations based on a same key? For example if you are calculating min, max etc., on a particular column value such as user_id, then it makes to store your original dataframe after bucketing it by user id. Also for joining, if you are using the same key, you can partition it by them. If you dont bucket, then for each transformation, there is a spark shuffle.

This answer should help - Spark Data set transformation to array

Larger number of Transformations on multiple dataframe in Spark

1 Answers1