1
JavaRDD<String> history_ = sc.emptyRDD();

java.util.Queue<JavaRDD<String> > queue = new LinkedList<JavaRDD<String>>();
queue.add(history_);
JavaDStream<String> history_dstream = ssc.queueStream(queue);

JavaPairDStream<String,ArrayList<String>> history = history_dstream.mapToPair(r -> {
  return new Tuple2< String,ArrayList<String> >(null,null);
});  

 JavaPairInputDStream<String, GenericData.Record> stream_1 =
    KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
        GenericDataRecordDecoder.class, props, topicsSet_1);


JavaPairInputDStream<String, GenericData.Record> stream_2 =
    KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
        GenericDataRecordDecoder.class, props, topicsSet_2);

then doing some transformation and creating twp DStream Data_1 and Data_2 of type

JavaPairDStream<String, <ArrayList<String>>

and do the join as below , then filtering out those records for whom there was no joining key and saving them in history for using it in next batch by doing its union with Data_1

 Data_1 = Data_1.union(history);

JavaPairDStream<String, Tuple2<ArrayList<String>, Optional<ArrayList<String>>>> joined =
    Data_1.leftOuterJoin(Data_2).cache();


JavaPairDStream<String, Tuple2<ArrayList<String>, Optional<ArrayList<String>>>> notNULL_join = joined.filter(r -> r._2._2().isPresent());
JavaPairDStream<String, Tuple2<ArrayList<String>, Optional<ArrayList<String>>>> dstream_filtered = joined.filter(r -> !r._2._2().isPresent());

history = dstream_filtered.mapToPair(r -> {
  return new Tuple2<>(r._1,r._2._1);
}).persist;

I get history after the previous step(checked by saving it to hdfs) , but still this history is empty in batch while doing union.

JSR29
  • 354
  • 1
  • 5
  • 17
  • If I understand this correctly, what you would like to achieve is to keep a history of records of elements of `Data1` not yet found in `Data2` until they are found. Is that right? Any other additional requirement? What's the usecase? – maasg Jun 08 '17 at 12:52
  • Yes that's it, now what you suggest – JSR29 Jun 08 '17 at 12:53
  • No additional requirements , use case is to find click count by user using join on pageview and click events – JSR29 Jun 08 '17 at 13:24

1 Answers1

2

It's conceptually not possible to "remember" a DStream. DStreams are time-bound and on each clock-tick (called "batch interval") the DStream represents the observed data in the stream during that period of time.

Hence, we cannot have an "old" DStream saved to join with a "new" DStream. All DStreams live in the "now".

The underlying data structure of DStreams is the RDD: Each batch interval, our DStream will have 1 RDD of the data for that interval. RDDs represent a distributed collection of data. RDDs are immutable and permanent, for as long as we have a reference to them.

We can combine RDDs and DStreams to create the "history roll over" that's required here.

It looks pretty similar to the approach on the question, but only using the history RDD.

Here's a high-level view of the suggested changes:

var history: RDD[(String, List[String]) = sc.emptyRDD()

val dstream1 = ...
val dstream2 = ...

val historyDStream = dstream1.transform(rdd => rdd.union(history))
val joined = historyDStream.join(dstream2)

... do stuff with joined as above, obtain dstreamFiltered ...

dstreamFiltered.foreachRDD{rdd =>
   val formatted = rdd.map{case (k,(v1,v2)) => (k,v1)} // get rid of the join info
   history.unpersist(false) // unpersist the 'old' history RDD
   history = formatted // assign the new history
   history.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
   history.count() //action to materialize this transformation
}

This is only a starting point. There're additional considerations with regards to checkpointing. Otherwise the lineage of the history RDD will grow unbounded until some StackOverflow happens. This blog is quite complete on this particular technique: http://www.spark.tc/stateful-spark-streaming-using-transform/

I also recommend you using Scala instead of Java. The Java syntax is too verbose to use with Spark Streaming.

maasg
  • 37,100
  • 11
  • 88
  • 115
  • @JSR29 it should be `formatted`. Fixed it in the answer. – maasg Jun 08 '17 at 14:02
  • regarding how history builds up, it's intrinsically composed by unions of `RDD`s as consequence of this line: `dstream1.transform(rdd => rdd.union(history))`. That's why I mention that checkpoint is important. Otherwise, the lineage of that RDD will grow unbounded over time. – maasg Jun 08 '17 at 14:30
  • regarding java, you can probably get around with `final` requirement by using an array as mutable intermediate. In general, I find java a bad choice for use with Spark Streaming. The syntactical constructions are awful. – maasg Jun 08 '17 at 14:34
  • Can you tell something using the mapWithState – JSR29 Jun 08 '17 at 14:34
  • You could probably approach the click count usecase with `mapWithState`, but not in the approach of this question. Give it a try and ask new questions if those arise. StructuredStreaming is another option in the Spark family. – maasg Jun 08 '17 at 15:07
  • can I use a window of 24 hrs. – JSR29 Jun 09 '17 at 04:58
  • It's your choice to use java, not mine. ;-) Sorry, I won't go through that pain. Not even the programming guide does: http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#design-patterns-for-using-foreachrdd – maasg Jun 09 '17 at 07:50
  • Thanks @maasg , this approach worked , I have to verify the results – JSR29 Jun 09 '17 at 11:24
  • Cool. As you productize the job, don't forget to add the `checkpoint` function as explained in that blog. Otherwise, it will crash after some time (hours, days,?) or runtime. – maasg Jun 09 '17 at 11:29
  • Can you explain the building of history – JSR29 Jun 11 '17 at 08:51
  • @JSR29 Yes, I could, but I don't think it would fit into a comment. Care to open a new question? – maasg Jun 11 '17 at 10:12
  • here is the link of the question https://stackoverflow.com/questions/44482883/how-history-rdds-are-preserved-for-further-use-in-the-given-code – JSR29 Jun 11 '17 at 10:41
  • "RDD: Each batch interval, our DStream will have 1 RDD of the data for that interval" . is that a assumption you are making – JSR29 Jun 14 '17 at 11:41
  • Why don't persist and cache didn't worked on filter DStream , remember also didn't worked – JSR29 Jun 22 '17 at 11:59