how to cache data in apache spark that can be used by other spark job

Question

I have a simple spark code in which I read a file using SparkContext.textFile() and then doing some operations on that data, and I am using spark-jobserver for getting output. In code I am caching the data but after job ends and I execute that spark-job again then it is not taking that same file which is already there in cache. So, every time file is getting loaded which is taking more time.

Sample Code is as:

val sc=new SparkContext("local","test")
val data=sc.textFile("path/to/file.txt").cache()
val lines=data.count()
println(lines)

Here, if I am reading the same file then when I execute it second time then it should take data from cache but it is not taking that data from cache.

Is there any way using which I can share the cached data among multiple spark jobs?

score 1 · Answer 1 · answered Jul 27 '15 at 10:15

1

Yes - by calling persist/cache on the RDD you get and submitting additional jobs on the same context

answered Jul 27 '15 at 10:15

Arnon Rotem-Gal-Oz

25,469
3
45
68

but if i run the job again then SparkContext will be different, so data cached in previous job will not be available. – Gourav Jul 27 '15 at 10:29
right you have to keep the context running - for example you can use spark job-server https://github.com/spark-jobserver/spark-jobserver to have long running contexts across multiple jobs – Arnon Rotem-Gal-Oz Jul 27 '15 at 10:37

how to cache data in apache spark that can be used by other spark job

1 Answers1