Caching DStream in Spark Streaming

Question

I have a Spark streaming process which reads data from kafka, into a DStream.

In my pipeline I do two times (one after another):

DStream.foreachRDD( transformations on RDD and inserting into destination).

(each time I do different processing and insert data to different destination).

I was wondering how would DStream.cache, right after I read data from Kafka work? Is it possible to do it?

Is the process now actually reading data two times from Kafka?

Please keep in mind, that it is not possible to put two foreachRDDs into one (because two paths are quite different, there are statefull transformations there - which need to be appliend on DStream...)

Thanks for your help

Dstream.cache will work . It caches the stream first time it sees an action. And for the subsequent action in DStream it uses the cache. — Knight71, Jun 08 '16 at 04:37
@Knight71 I also need to set DStream.unpersist(true), same as when caching RDDs, at the end when DStream is not longer necessary? — Srdjan Nikitovic, Jun 08 '16 at 08:22
Dstream data will be cleared automatically after all operations and it is decided by spark streaming based on transformations. — Knight71, Jun 08 '16 at 08:44
@Knight71, thank you for your answer? If I do not put DStream.cache, does it mean that Spark will read the data two time from the Kafka (based on my use case I specified in the question) — Srdjan Nikitovic, Jun 08 '16 at 08:55
Another useful link: http://stackoverflow.com/questions/30253897/does-caching-in-spark-streaming-increase-performance — Stanislav, May 08 '17 at 20:14

score 7 · Accepted Answer · answered Jun 08 '16 at 11:54

There're two options:

Use Dstream.cache() to mark the underlying RDDs as cached. Spark Streaming will take care of unpersisting the RDDs after a timeout, controlled by the spark.cleaner.ttl configuration.
Use additional foreachRDD to apply cache() and unpersist(false) side-effecting operations to the RDDs in the DStream:

e.g:

val kafkaDStream = ???
val targetRDD = kafkaRDD
                       .transformation(...)
                       .transformation(...)
                       ...
// Right before the lineage fork mark the RDD as cacheable:
targetRDD.foreachRDD{rdd => rdd.cache(...)}
targetRDD.foreachRDD{do stuff 1}
targetRDD.foreachRDD{do stuff 2}
targetRDD.foreachRDD{rdd => rdd.unpersist(false)}

Note that you could incorporate the cache as the first statement of do stuff 1 if that's an option.

I prefer this option because it gives me fine-grained control over the cache lifecycle and lets me cleanup stuff as soon as it's needed instead of depending of a ttl.

```spark.cleaner.ttl``` is removed. What's the new property control this? — petertc, Jan 05 '18 at 06:54

Caching DStream in Spark Streaming

1 Answers1

Linked