2

I have a simple spark code in which I read a file using SparkContext.textFile() and then doing some operations on that data, and I am using spark-jobserver for getting output. In code I am caching the data but after job ends and I execute that spark-job again then it is not taking that same file which is already there in cache. So, every time file is getting loaded which is taking more time.

Sample Code is as:

val sc=new SparkContext("local","test")
val data=sc.textFile("path/to/file.txt").cache()
val lines=data.count()
println(lines)

Here, if I am reading the same file then when I execute it second time then it should take data from cache but it is not taking that data from cache.

Is there any way using which I can share the cached data among multiple spark jobs?

Jon Surrell
  • 9,444
  • 8
  • 48
  • 54
Gourav
  • 1,245
  • 2
  • 10
  • 12

1 Answers1

1

Yes - by calling persist/cache on the RDD you get and submitting additional jobs on the same context

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • but if i run the job again then SparkContext will be different, so data cached in previous job will not be available. – Gourav Jul 27 '15 at 10:29
  • right you have to keep the context running - for example you can use spark job-server https://github.com/spark-jobserver/spark-jobserver to have long running contexts across multiple jobs – Arnon Rotem-Gal-Oz Jul 27 '15 at 10:37