I have two streams, 'left' stream and 'right' stream. I would like to do a leftOuter join on the streams. I would like to collect the events on 'left' stream that couldn't join with 'right' stream. The watermark delay on both the streams is present(20 minutes).
Issue is that, as long as there is data on the right stream within a watermark, the unjoined events are showing up. But lets say after a day of not generating any events on both the streams, I generate 'left' events without generating any 'right' events, the 'left' events are getting dropped and not showing up as unjoined data.
I am expecting the unjoined events to show up at the end of watermark.
The code is as follows
right = right.withWatermark("left_event_time", "20 minutes")
left = left.withWatermark("right_event_time", "20 minutes")
joineddf = leftdf.join(
rightdf,
expr("""
left_id = right_id AND
left_event_time >= right_event_time - interval 20 minutes AND
left_event_time <= right_event_time + interval 20 minutes
"""),
"leftOuter"
successdf = joineddf.filter(col("right_field").isNotNull())
unjoined = joineddf.filter(col("right_field").isNull())
I am expecting to get unjoined events even if rightdf is empty.
I tried changing watermark to 10 seconds for experimentation and generated events only on 'left' stream. Even after waiting for few minutes(7 minutes), unjoined events didn't show up. But once I generated an event on 'right' stream, the unjoined events showed up that were generated earlier showed up.