How to cache spark streaming Dataset

Question

I have a spark streaming Dataset<Row> which streams directory of csv files. So I have these questions:

How to cache the streaming dataset.
How to submit my spark streaming job in YARN so, my streaming job should run forever until manual interruption from the user.

Please read [Under what circumstances may I add “urgent” or other similar phrases to my question, in order to obtain faster answers?](//meta.stackoverflow.com/q/326569) - the summary is that this is not an ideal way to address volunteers, and is probably counterproductive to obtaining answers. Please refrain from adding this to your questions. — halfer, Nov 28 '18 at 20:51

Harjeet Kumar · Answer 1 · 2018-11-30T10:45:47.460

0

You can cache your streaming data using cache or persist function as following

 dstream.persist()

Do it only if you are using stream multiple times. For reducebywindow and reducebyKeyandWindow operation this is automatically done.

In your streaming job to keep your job running you need to initiate spark streaming context and start this context

val ssc = new StreamingContext(sc, Seconds(1))
// your logic goes here
ssc.start()

If your job is getting killed after running for few hours(and your cluster is kerborized), then check if kerberos tickets are expiring. This can cause long running job to fail.

Edit : Note : If you are Talking specifically about structured streaming. cache on streaming datasets is not supported.. check this post Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

edited Nov 30 '18 at 10:45

answered Nov 28 '18 at 07:04

Harjeet Kumar

504
2
7

In my spark streaming job, i have streaming dataset of CSV file called Dataset (not DStream). May be we can persists the DStream, but i am unable to cache() the streaming dataset i.e Dataset. So how to cache streaming dataset.???? – Mahadevappa M Utagi Nov 29 '18 at 10:08
cache operation is not yet supported in structured streaming... you should have a look at this post. It discusses about same thing.. https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries – Harjeet Kumar Nov 30 '18 at 10:44

How to cache spark streaming Dataset

1 Answers1