In short, the best solution is to use stateful DoFn in Beam. You can have a per key state (and per window, which is global window in your case).You can save one stream events in state and once events from another stream appear with the same key, join it with events in state. Here is a reference[1].
However, the short answer does not utilize true power of Beam model. The Beam model provides ways to balance among latency, cost and accuracy. It provides simple API to hide complex of streaming processing.
Why I am saying that? Let's go back to the short answer's solution: stateful DoFn. In stateful DoFn approach, you are lack of ways to address following questions:
- What if you have buffered 1M events for one key and there is still no event appear from another stream? Do you need to empty the state? What if the event appear right after you emptied the state?
- If eventually there is one event that appear to finish a JOIN, is the cost of buffering 1M events acceptable for JOIN a single event from another stream?
- How to handle late date on both streams? Say You have joined
<1, a>
from left stream on <1, b> from right stream. Later there is another <1, c>
from left stream, how do you know that you only need to emit <1, <c, b>>
, assume this is incremental mode to output result. If you start to buffer those already joined events to get delta, that really becomes too complicated for a programmer.
Beam's windowing, trigger, refinement on output data, watermark and lateness SLA control are designed to hide these complex from you:
- watermark: tells when windows are complete such that events will not long come (and further events are treated as late data)
- Lateness SLA control: control the time you cache data for join.
- refinement on output data: update output correctly if allowed new events arrive.
Although Beam model is well designed. The implementation of Beam model are missing critical features to support the join you described:
- windowing is not flexible enough to support your case where streams have huge different frequencies (so fixed and sliding window does not fit). And you also don't know the arrival rate of streams (so session window does not really fit as you have to give a gap between session windows).
- retraction is missing such that you cannot refine your output once late events arrive.
To conclude, Beam model is designed to handle complex in streaming processing, which perfectly fits your need. But the implementation is not good enough to let you use it now to finish your join use case.
[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html