Scope of Spark's `persist` or `cache`

Question

I am confused about RDD's scoping in Spark.

Whether an RDD is cached or not is part of the mutable state of the RDD object. If you call rdd.cache it will be marked for caching from then on. It does not matter what scope you access it from.

So, if I defined a function with a new rdd created inside, for example (python code)

# there is an rdd called "otherRdd" outside the function

def myFun(args):
    ...
    newRdd = otherRdd.map(some_function)
    newRdd.persist()
    ...

Will the newRdd lives in the global namespace? or it is only visible inside the environment of myFun?

If it is only visible inside the environment of myFun, after myFun finishes execution, will Spark automatically unpersist the newRdd?

score 5 · Accepted Answer · edited May 23 '17 at 12:01

5

Yes, when an RDD is garbage collected, it is unpersisted. So outside of myFun, newRdd is unpersisted (assuming you do not return it nor a derived rdd), you can also check this answer.

edited May 23 '17 at 12:01

Community

1
1

answered Jul 10 '16 at 15:02

geoalgo

678
3
11

Scope of Spark's `persist` or `cache`

1 Answers1