2

I have a Spark streaming process which reads data from kafka, into a DStream.

In my pipeline I do two times (one after another):

DStream.foreachRDD( transformations on RDD and inserting into destination).

(each time I do different processing and insert data to different destination).

I was wondering how would ​DStream.cache​, right after I read data from Kafka work? Is it possible to do it?

Is the process now actually reading data two times from Kafka?

Please keep in mind, that it is not possible to put two foreachRDDs into one (because two paths are quite different, there are statefull transformations there - which need to be appliend on DStream...)

Thanks for your help

Srdjan Nikitovic
  • 853
  • 2
  • 9
  • 19
  • 1
    Dstream.cache will work . It caches the stream first time it sees an action. And for the subsequent action in DStream it uses the cache. – Knight71 Jun 08 '16 at 04:37
  • @Knight71 I also need to set DStream.unpersist(true), same as when caching RDDs, at the end when DStream is not longer necessary? – Srdjan Nikitovic Jun 08 '16 at 08:22
  • Dstream data will be cleared automatically after all operations and it is decided by spark streaming based on transformations. – Knight71 Jun 08 '16 at 08:44
  • @Knight71, thank you for your answer? If I do not put DStream.cache, does it mean that Spark will read the data two time from the Kafka (based on my use case I specified in the question) – Srdjan Nikitovic Jun 08 '16 at 08:55
  • Another useful link: http://stackoverflow.com/questions/30253897/does-caching-in-spark-streaming-increase-performance – Stanislav May 08 '17 at 20:14

1 Answers1

7

There're two options:

  • Use Dstream.cache() to mark the underlying RDDs as cached. Spark Streaming will take care of unpersisting the RDDs after a timeout, controlled by the spark.cleaner.ttl configuration.

  • Use additional foreachRDD to apply cache() and unpersist(false) side-effecting operations to the RDDs in the DStream:

e.g:

val kafkaDStream = ???
val targetRDD = kafkaRDD
                       .transformation(...)
                       .transformation(...)
                       ...
// Right before the lineage fork mark the RDD as cacheable:
targetRDD.foreachRDD{rdd => rdd.cache(...)}
targetRDD.foreachRDD{do stuff 1}
targetRDD.foreachRDD{do stuff 2}
targetRDD.foreachRDD{rdd => rdd.unpersist(false)}

Note that you could incorporate the cache as the first statement of do stuff 1 if that's an option.

I prefer this option because it gives me fine-grained control over the cache lifecycle and lets me cleanup stuff as soon as it's needed instead of depending of a ttl.

maasg
  • 37,100
  • 11
  • 88
  • 115