1

Is it mandatory to use df.unpersist() after using df.cache() to release the cache memory? If I store my DataFrame in cache without unpersisting, then the code runs very quickly. However, it takes pretty longer time when I use df.unpersist().

Markus
  • 3,562
  • 12
  • 48
  • 85

1 Answers1

5

It is not mandatory, but if you have a long run ahead and you want to release resources that you no longer need, it's highly suggested that you do it. Spark will anyhow manage these for you on an LRU basis; quoting from the docs:

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion.

The unpersist method does this by default, but consider that you can explicitly unpersist asynchronously by calling it with the a blocking = false parameter.

df.unpersist(false) // unpersists the Dataframe without blocking

The unpersist method is documented here for Spark 2.3.0.

stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
  • What is the difference between `df.unpersist()` and `df.unpersist(false)` ? – thentangler Jan 25 '21 at 02:28
  • 1
    It means to not wait for all blocks to be unpersisted before returning. – stefanobaghino Jan 26 '21 at 07:03
  • What do you mean by blocking? – Louis Yang Feb 12 '21 at 04:26
  • 1
    "Blocking" is the term usually applied to any form of call that returns the control to the caller after it has performed the action it's supposed to. In contrast, a call "does not block" if it somehow starts performing the action and returns the control to the caller while the action keeps running in the background. These latter methods usually rely on either accepting a callback to be run when the action is performed or by returning some form of handle that represents the "promise" that the action will be performed (sometime called a "future"). – stefanobaghino Feb 12 '21 at 07:05
  • lovely explanation – Topde Sep 07 '21 at 12:39
  • Even after writing pyspark to a file we need to unpersist? – haneulkim Jun 09 '23 at 09:26