1

I am using the below command to get the list of available registered Temp tables sqlContext.sql("show tables").collect().foreach(println)

Is there any similar command to get list of available RDDs?

Here is my requirement (using scala) 1. Need to create some RDD on the fly 2. Identify list of available RDDs 3. remove/delete/clear the unwanted RDDs and move forward

How to delete an RDD in PySpark for the purpose of releasing resources?

An additional note, I went through this link, but it doesn't answer all my questions... also i tried the below but don't find any difference before and after unpersist, so not sure how to confirm that my RDD has been released the memory

val tempRDD1 = RDD1.reduceByKey((acc,value)=> acc+value)
tempRDD1.collect.foreach(println)
tempRDD1.unpersist()
tempRDD1.collect.foreach(println)
Community
  • 1
  • 1
saranvisa
  • 135
  • 2
  • 7
  • 1
    Possible duplicate of [Spark list all cached RDD names](http://stackoverflow.com/questions/38508577/spark-list-all-cached-rdd-names) – zero323 May 01 '17 at 20:40
  • the link that you have mentioned has some points that I am looking for but none of the answers helping me. One answer says "We noticed that actually it isn't persisted" - not working, the other answer says "it is not yet implemented in python" - but i am looking for scala. – saranvisa May 01 '17 at 20:54

1 Answers1

0

The RDD data is not saved until it is 1. persisted (cached) and 2. an action occurs to force the preceding transformations to occur. If either of these do not occur, no data will be stored. Any RDD that appears to be "created", will just create an action plan to produce the data if it is needed later. This model is called lazy evaluation.

In your example, no RDD is ever cached, so no data will ever be stored in memory. And the unpersist call will have no effect.

zero323
  • 322,348
  • 103
  • 959
  • 935
David
  • 11,245
  • 3
  • 41
  • 46
  • In fact, i didn't share my full code but if you pay more attention, you can notice that the code has multiple RDDs, the tempRDD1 is created from RDD1 and already action applied on RDD1 – saranvisa May 01 '17 at 21:07
  • Were any of those RDDs explicitly cached? – David May 02 '17 at 12:53
  • Not cached, they are temporary RDDs and not required after few steps – saranvisa May 02 '17 at 23:31
  • If they aren't cached, then no data is stored for them. tempRDD1 in your example doesn't need to be "released from memory" because it was never stored in memory. If you run `tempRDD1.collect.foreach(println)` 10 times in a row, it will do all the calculations to create RDD1, reduceByKey, and collect operations 10 times. No caching = no data stored in memory = no need to find and unpersist all the temp RDDs – David May 03 '17 at 15:00