1

I was wondering what is the scope of a cached RDD. For example:

// Cache an RDD.
rdd.cache
// Pass the RDD to a method of another class.
otherClass.calculate(rdd) // This method performs various actions.
// Pass the RDD to a method of the same class.
calculate(rdd)            // This method also performs some actions.
// Perform an action in the same method where the RDD was cached.
rdd.count

In the example above, will the RDD be materialized once? (It won't have to be recreated?) What is the scope of caching?

And should I always unpersist the RDD after I used it, if I don't need it anymore?

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
Al Jenssen
  • 655
  • 3
  • 9
  • 25
  • 1
    1. It will be computed once, 2. It takes resources so it makes sense to unpersist – zero323 Oct 02 '15 at 16:03
  • Thank you very much and just to be clear if I had not used the cache method, it would be recomputed every time I used an action? – Al Jenssen Oct 02 '15 at 16:11

1 Answers1

2

Whether an RDD is cached or not is part of the mutable state of the RDD object. If you call rdd.cache it will be marked for caching from then on. It does not matter what scope you access it from.

As to whether you should unpersist the RDD: The RDD will be unpersisted automatically if it is garbage collected. It is for you to decide whether this is soon in enough. The cache takes up space on the executors, while the automatic cleanup happens in response to memory pressure on the driver.

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114