I have a transform engine built on the spark that is metadata-driven. I perform a set of transformations on multiple data frames stored in memory in a Scala Map[String, DataFrame]. I encounter a condition where I generate a data frame using 84 transforms including(withColumn, Join, union etc). After these, the output data frame is used as an input to another set of transformations.
If I write the intermediate transformation result after first 84 transformations and then load the dataframe into the Map from the output path. The next set of transformations works fine. If I do not this, it takes 30 mins just to evaluate.
My Approach: I tried persisting the Dataframe using:
dfMap(target).cache()
But this approach did not help.