5

I am working on an attribution report and i am caching the dataframe since it is being used frequently in the later stage of the code. Once the usage is done, should i unpersist() or unpersist(true). I understand the basic difference would be async and sync respectively. But does one have more latency than the other? Or is there any other implications to it?

val dfForWeb = loadData(aggregationType, readConfigForWeb).cache()
//some logical code blocks
..
..
..
dfForWeb.unpersist() //This works fine

//Tried using the below and got the same result:

//dfForWeb.unpersist(true) --This also works fine

The actual code is as follows:

val dfForWeb = loadData(aggregationType, readConfigForWeb).cache()
val dfForMobile = loadData(aggregationType, readConfigForMobile).cache()
if (condition) {
  for (item <- GeoAggregationList) {
    processData(dfForWeb) //This dataframe is used for a lot of computations later
  }
} else {
  processData(dfForWeb) //This dataframe is used for a lot of computations later
}
dfForWeb.unpersist()
dfForMobile.unpersist()

I am trying to be cautious as this application needs to be scaled and when the actual data is processed, i am doubtful if unpersist() and unpersist(true) would make a huge difference in terms on latency and data loss. Please advise.

  • 3
    The main difference is that `unpersist(true)` will **block** your computation pipeline in the moment it reaches that instruction until it has finished removing the contents of the RDD/DF/DS. While `unpersist(false)` or just `unpersist()` will just put a mark on the RDD/DF/DS which tells spark it can safely deletes it whenever it needs to - note that if spark needs memory for a computation it may even deletes cached data that is not marked to be deleted. I think is better to just leave it to spark, it might not even need to delete it - thus you will lose time by forcing its deletion. – Luis Miguel Mejía Suárez Dec 29 '18 at 03:02
  • 1
    Oh, great. Since it was a small piece of test data, i was not able to see the latency. Now that it is deployed, unpersist() does help me. Thank you @LuisMiguelMejíaSuárez – Lakshmi Narasimhan M C Jan 02 '19 at 01:46

0 Answers0