33

We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I notice that If I call unpersist function myself, I get slower performance.

user4157124
  • 2,809
  • 13
  • 27
  • 42
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
  • 1
    If you cache an RDD, you'll have to unpersist yourself! – eliasah Sep 17 '15 at 18:08
  • @eliasah what happens if the memory is full? Doesn't spark unpersist the RDD's in LRU fashion. – None Sep 17 '15 at 18:47
  • Nope it doesn't. Spark isn't a cache system. You might consider using and external cache, Or you want to persist on disk and on ram. Nevertheless, if there is no space on the disk, you'll get an not space available on device error. – eliasah Sep 17 '15 at 18:51
  • 1
    @eliasah: Interesting, my understanding is exactly the opposite of yours. 1) The RDD will be unpersisted when GCd. 2) Memory pressure will also push out the RDD from the cache. 3) A big part of Spark is a cache system. I hope you can post your references. I posted an answer regarding the unpersist behavior, so you can also correct me there if I'm wrong. Thanks! – Daniel Darabos Sep 17 '15 at 21:39
  • Yet you can use it as in LRU fashion. What you are saying is also interesting. The issue with both our points of view is the definition of the cache scope. So actually Spark uses his own cache system but can we actually say that it is a cache system? What do you think? – eliasah Sep 17 '15 at 21:42
  • 1
    Haha, you're right — it's certainly not advertised as a "cache system". Also I'm not sure if it does LRU or FIFO or what. By the way I skimmed past your mention of _disk_ earlier. There is a good point there: disk space on the executors (used by RDDs persisted to disk and shuffle files) is getting cleaned up in response to GC on the driver. There is a danger of the executors filling up the disk before a GC would be triggered on the driver. We call `System.gc()` at certain points to try to avoid this. – Daniel Darabos Sep 17 '15 at 22:07
  • I think that with out comments here we can write a perfect detailed answer. Would you like to add the points we discussed in your answer for the benefit of the community? :-) – eliasah Sep 17 '15 at 22:10

2 Answers2

27

Yes, Apache Spark will unpersist the RDD when the RDD object is garbage collected.

In RDD.persist you can see:

sc.cleaner.foreach(_.registerRDDForCleanup(this))

This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected. And there:

sc.unpersistRDD(rddId, blocking)

For more context see ContextCleaner in general and the commit that added it.

A few things to be aware of when relying on garbage collection for unpersisting RDDs:

  • The RDDs use resources on the executors, and the garbage collection happens on the driver. The RDD will not be automatically unpersisted until there is enough memory pressure on the driver, no matter how full the disk/memory of the executors gets.
  • You cannot unpersist part of an RDD (some partitions/records). If you build one persisted RDD from another, both will have to fit entirely on the executors at the same time.
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • If you are actually defining a WeakReference upon the RDD within your code, how can we say that Spark does it when it's garbage collected? For me we are asking Spark to do for us when needed. Nevertheless, I'm voting up the answer for the it's quality even if I don't agree with totally with the 'Yes' – eliasah Sep 17 '15 at 21:45
  • 1
    I don't understand your comment and I believe you don't understand my post either :). _"within your code"_ — all the code I linked is inside Spark. Spark does this automatically. If you persist or cache an RDD it will be unpersisted when the RDD is GCd. – Daniel Darabos Sep 17 '15 at 21:47
  • But what if you persist it on disk? We both agree that Spark can do that. – eliasah Sep 17 '15 at 21:49
  • What I meant with "within the code" is equivalent to "inside Spark" :) I think that what creates confusion is the fact that some may expect Spark to do a complete clean of persisted data but I don't believe it's the case. We have to look more into the details of the Garbage collector. – eliasah Sep 17 '15 at 21:52
  • 2
    As you can see Spark calls `sc.unpersistRDD`. If the RDD was persisted to disk, it will be deleted from disk. Simple as that. You shouldn't believe me — read the code. One bit of useful information: The WeakReference with a ReferenceQueue is Java magic that does not block the garbage collection of an object, but will generate an "event" when the object is collected. This is how the `unpersist` is triggered on GC. – Daniel Darabos Sep 17 '15 at 21:56
  • I totally believe you and that what I mean! :) but I think here the asker was asking if Spark does it alone as a smart system. – eliasah Sep 17 '15 at 21:56
  • @DanielDarabos Thanks for the answer. If the RDD is initially computed in 100 executors and we cache it, does it mean those 100 executors are occupied and can no longer used by any other Spark app? And if we want to allow other app to run tasks in those, we have to do .unpersist? Why holding (i.e. cache) some memory resource would block other apps to leverage cpu compute resource? Isn't this an unwise strategy? Thanks – jack Dec 22 '20 at 16:05
  • No, the executors are not blocked by the RDD being cached. It only holds up memory and/or disk. It doesn't block computation. – Daniel Darabos Dec 22 '20 at 17:09
0

As pointed out by @Daniel, Spark will remove partitions from the cache. This will happen once there is no more memory available, and will be done using a least-recently-used algorithm. It is not a smart system, as pointed out by @eliasah.

If you are not caching too many objects you don't have to worry about it. If you cache too many objects, the JVM collection times will become excessive, so it is a good idea to unpersist them in this case.

Jorge
  • 191
  • 2
  • 8