2

I run a spark application, it uses a StorageLevel.OFF_HEAP to persist a rdd(my tachyon and spark are both in local mode).

like this:

val lines = sc.textFile("FILE_PATH/test-lines-1")
val words = lines.flatMap(_.split(" ")).map(word => (word, 1)).persist(StorageLevel.OFF_HEAP)
val counts = words.reduceByKey(_ + _)
counts.collect.foreach(println)
...
sc.stop

when persist done, I can see my OFF_HEAP files from localhost:19999(tachyon's web UI), this is what i excepted.

But, after the spark application over(sc.stop, but tachyon is working), my blocks(OFF_HEAP rdd) were removed. And I can not find my files from localhost:19999. This is not what I want. I think these files belong to Tachyon (not spark) after persist() method, they should not be removed.

so, who deleted my files, and when? Is this the normal way?

dtolnay
  • 9,621
  • 5
  • 41
  • 62
zeromem
  • 381
  • 1
  • 3
  • 12

1 Answers1

2

You are looking for

  saveAs[Text|Parquet|NewHadoopAPI]File()

This is the real "persistent" method you need.

Instead

persist()

is used for intermediate storage of RDD's: when the spark process ends they will be removed. Here is from the source code comments:

  • Set this RDD's storage level to persist its values across operations after the first time it is computed.

The important phrase is across operations - that is as part of processing (only).

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560