3

I'm using Spark 1.2.0 and haven't configured SPARK_LOCAL_DIRS explicitly so assuming that persisted RDDs would go to /tmp. I'm trying to persist and RDD using the following code:

    val inputRDD=sc.parallelize(List(1,2,3,3,4,5,6,7,8,9,19,22,21,25,34,56,4,32,56,70))
    val result = inputRDD.map(x=>x*x)
    println("Result count is: "+result.count())
    result.persist(StorageLevel.DISK_ONLY)
    println(result.collect().mkString(",,"))
    println("Result count is: "+result.count()) 

I force a count() on my RDD before and after persist just to be sure but i still don't see any new files or directories in /tmp. The only directory that changes when i run my code is hsperfdata.... which i know is for JVM perf data.

Where are my persisted RDDs going?

Jimit Raithatha
  • 398
  • 2
  • 4
  • 16
  • what's your cluster configurations? – eliasah Oct 18 '15 at 12:52
  • I haven't configured a cluster per se. Using IntelliJ for Scala and have just linked Spark libraries to my project. I'm still learning so haven't gotten around to configuring the spark-env.sh file yet. – Jimit Raithatha Oct 18 '15 at 17:55
  • Start reading the official documentation! I believe that you have some basic concept comprehension missing. – eliasah Oct 18 '15 at 18:31

1 Answers1

0

From scaladoc of RDD.persist()

Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. Local checkpointing is an exception.

So you've called result.count() on the line above result.persist(), by then Spark already set results persistence to be the default. Remove that count op and try again.

mehmetminanc
  • 1,359
  • 9
  • 14
  • I found the problem. Since i was using an IDE my SparkContext was getting destroyed at the end of the program cleaning up all data with it. After i tried persisting on the command line (keeping the context alive i could see the RDD) – Jimit Raithatha Oct 24 '15 at 20:57
  • 1
    I don't think it's expected that persisted RDDs will last beyond running your program, in REPL that makes sense but running Scala in IDE it makes sense it's gone when the program is done. Check the logs, you will probably see if cleaning up at the end. You need to export a text file (or HDFS etc). – JimLohse Feb 24 '16 at 20:29