How to delete an RDD in PySpark for the purpose of releasing resources?

Question

If I have an RDD that I no longer need, how do I delete it from memory? Would the following be enough to get this done:

del thisRDD

Thanks!

For me the following line of code did the trick: `for (id, rdd) in sc._jsc.getPersistentRDDs().items(): rdd.unpersist()` — drkostas, Jun 13 '18 at 13:54

score 16 · Answer 1 · answered Jan 19 '15 at 15:41

No, del thisRDD is not enough, it would just delete the pointer to the RDD. You should call thisRDD.unpersist() to remove the cached data.

For you information, Spark uses a model of lazy computations, which means that when you run this code:

>>> thisRDD = sc.parallelize(xrange(10),2).cache()

you won't have any data cached really, it would be only marked as 'to be cached' in the RDD execution plan. You can check it this way:

>>> print thisRDD.toDebugString()
(2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
 |  ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]

But when you call an action on top of this RDD at least once, it would become cached:

>>> thisRDD.count()
10
>>> print thisRDD.toDebugString()
(2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
 |       CachedPartitions: 2; MemorySize: 174.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B
 |  ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]

You can easily check the persisted data and the level of persistence in the Spark UI using the address http://<driver_node>:4040/storage. You would see there that del thisRDD won't change the persistence of this RDD, but thisRDD.unpersist() would unpersist it, while you still would be able to use thisRDD in your code (while it won't persist in memory anymore and would be recomputed each time it is queried)

score 15 · Answer 2 · edited May 23 '17 at 11:47

Short answer: The following code should do the trick:

import gc
del thisRDD
gc.collect()

Explanation:

Even if you are using PySpark, your RDD's data is managed on the Java side, so first let's ask the same question, but for Java instead of Python:

If I'm using Java, and I simply release all references to my RDD, is that sufficient to automatically unpersist it?

For Java, the answer is YES, the RDD will be automatically unpersisted when it is garbage collected, according to this answer. (Apparently that functionality was added to Spark in this PR.)

OK, what happens in Python? If I remove all references to my RDD in Python, does that cause them to be removed on the Java side?

PySpark uses Py4J to send objects from Python to Java and vice-versa. According to the Py4J Memory Model Docs:

Once the object is garbage collected on the Python VM (reference count == 0), the reference is removed on the Java VM

But take note: Removing the Python references to your RDD won't cause it to be immediately deleted. You have to wait for the Python garbage collector to clean up the references. You can read the Py4J explanation for details, where they recommend the following:

A call to gc.collect() also usually works.

OK, now back to your original question:

Would the following be enough to get this done:
del thisRDD

Almost. You should remove the last reference to it (i.e. del thisRDD), and then, if you really need the RDD to be unpersisted immediately**, call gc.collect().

**Well, technically, this will immediately delete the reference on the Java side, but there will be a slight delay until Java's garbage collector actually executes the RDD's finalizer and thereby unpersists the data.

nonsleepr · Answer 3 · 2016-04-01T17:30:21.023

6

Short answer: it depends.

According to pyspark v.1.3.0 source code, del thisRDD should be enough for PipelinedRDD, which is an RDD generated by Python mapper/reducer:

class PipelinedRDD(RDD):
    # ...
    def __del__(self):
        if self._broadcast:
            self._broadcast.unpersist()
            self._broadcast = None

RDD class on the other hand, doesn't have __del__ method (while it probably should), so you should call unpersist method on your own.

Edit: __del__ method was deleted in this commit.

edited Apr 01 '16 at 17:30

answered Feb 04 '15 at 21:07

nonsleepr

801
9
12

Could provide branch specific URL. Based on a date it suggests you linked to 1.1 or 1.2 but it doesn't look like there was `__del__` then and it certainly isn't now. – zero323 Apr 01 '16 at 06:04
It was v1.3.0 updated the link. [This](https://github.com/apache/spark/commit/f11288d5272bc18585b8cad4ee3bd59eade7c296?diff=split) commit killed it. – nonsleepr Apr 01 '16 at 17:28
Thanks! I see how I missed it - it existed only in 1.2. – zero323 Apr 01 '16 at 17:48
If I understand correctly, it doesn't matter that `__del__` is not implemented on the Python side. The RDD will be unpersisted on the Java side when the last reference to it disappears. If all references on the Python side have been deleted, then Py4J ensures that the reference on the Java side disappears, too, and thus the Java RDD finalizer is executed. (I added an answer that explains my reasoning, but it could use a review.) – Stuart Berg Oct 10 '16 at 21:40

score 3 · Answer 4 · edited Jun 06 '18 at 12:03

3

Just FYI, I would recommend gc.collect() after del (if rdd takes lots of memory).

edited Jun 06 '18 at 12:03

Tshilidzi Mudau

7,373
6
36
49

answered Jul 26 '16 at 22:01

joshsuihn

770
1
10
25

How to delete an RDD in PySpark for the purpose of releasing resources?

4 Answers4

Linked