I am confused about RDD's scoping in Spark.
According to this thread
Whether an RDD is cached or not is part of the mutable state of the RDD object. If you call rdd.cache it will be marked for caching from then on. It does not matter what scope you access it from.
So, if I defined a function with a new rdd created inside, for example (python code)
# there is an rdd called "otherRdd" outside the function
def myFun(args):
...
newRdd = otherRdd.map(some_function)
newRdd.persist()
...
Will the newRdd
lives in the global namespace? or it is only visible inside the environment of myFun
?
If it is only visible inside the environment of myFun
, after myFun
finishes execution, will Spark automatically unpersist
the newRdd
?