Join 2 unbounded Pcollections on key

Question

I am trying to join two unbounded PCollection that I am getting from 2 different kafka topics on the basis of a key.

As per the docs and other blogs a join can only be possible if we do windowing. Window collects the messages from both the streams in a particular window and joins it. Which is not what I need.

The result expected is in one stream the messages are coming at a very low frequency, and from other stream we are getting messages at a high frequency. I want that if the value of the key has not arrived on both the streams we won't do a join till then and after it arrives do the join. Is it possible using the current beam paradigm ?

score 2 · Answer 1 · answered Apr 16 '19 at 18:44

In short, the best solution is to use stateful DoFn in Beam. You can have a per key state (and per window, which is global window in your case).You can save one stream events in state and once events from another stream appear with the same key, join it with events in state. Here is a reference[1].

However, the short answer does not utilize true power of Beam model. The Beam model provides ways to balance among latency, cost and accuracy. It provides simple API to hide complex of streaming processing.

Why I am saying that? Let's go back to the short answer's solution: stateful DoFn. In stateful DoFn approach, you are lack of ways to address following questions:

What if you have buffered 1M events for one key and there is still no event appear from another stream? Do you need to empty the state? What if the event appear right after you emptied the state?
If eventually there is one event that appear to finish a JOIN, is the cost of buffering 1M events acceptable for JOIN a single event from another stream?
How to handle late date on both streams? Say You have joined <1, a> from left stream on <1, b> from right stream. Later there is another <1, c> from left stream, how do you know that you only need to emit <1, <c, b>>, assume this is incremental mode to output result. If you start to buffer those already joined events to get delta, that really becomes too complicated for a programmer.

Beam's windowing, trigger, refinement on output data, watermark and lateness SLA control are designed to hide these complex from you:

watermark: tells when windows are complete such that events will not long come (and further events are treated as late data)
Lateness SLA control: control the time you cache data for join.
refinement on output data: update output correctly if allowed new events arrive.

Although Beam model is well designed. The implementation of Beam model are missing critical features to support the join you described:

windowing is not flexible enough to support your case where streams have huge different frequencies (so fixed and sliding window does not fit). And you also don't know the arrival rate of streams (so session window does not really fit as you have to give a gap between session windows).
retraction is missing such that you cannot refine your output once late events arrive.

To conclude, Beam model is designed to handle complex in streaming processing, which perfectly fits your need. But the implementation is not good enough to let you use it now to finish your join use case.

[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html

True I agree with you on that. The beam model has spanned it's wings but some of the implementation still needs to be done. That is why I had a doubt if it can be done in any case or not. Thanks Rui for the detailed info on it. — capt2101akash, Apr 18 '19 at 07:10

score 0 · Answer 2 · answered Apr 16 '19 at 18:42

This isn't something that is well supported by the Beam model today, but there are a few ways you can do it. These examples assume each key appears exactly once on each stream, if that isn't the case you'll need to adjust them.

One option is to use the Global Window and Stateful DoFn instead of a Join. The Global Window effectively turns windowing off. A stateful DoFn lets you store data about the key you are processing in a "state cell" for later use. When you receive a record, you would check the state cell for a value. If you find one, do the join, emit the value, and clear the state. If there isn't anything, store the current value.

Another option is to use Session Windows and Join. The session window "GapDuration" is effectively a timeout on a given key. This works as long as you have a time bound in which you will see the Key on both streams. You'll also want to setup an element count trigger "AfterPane.elementCountAtLeast(2)" so you don't have to wait for the full timeout after seeing the second piece of data.

I would say using windowed joins will not satisfy my need as that join will only be valid for that particular window. But my use case here is global - I need to update the key's value whenever I see a new value of that particular key, In this case what happens is it will only update the key-value pair for this particular session. And regarding stateful DoFns I would agree with the above explanation. It's a way to do it but that also is not implemented in a correct enough way so that we should go with it. — capt2101akash, Apr 18 '19 at 07:15

Join 2 unbounded Pcollections on key

2 Answers2