Left outer join not emitting null values when joining two streams in spark structured streaming 2.3.0

Question

Left outer join on two streams not emitting the null outputs. It is just waiting for the record to be added to the other stream. Using socketstream to test this. In our case, we want to emit the records with null values which don't match with id or/and not fall in time range condition

Details of the watermarks and intervals are:

val ds1Map = ds1
.selectExpr("Id AS ds1_Id", "ds1_timestamp")
.withWatermark("ds1_timestamp","10 seconds")

val ds2Map = ds2
.selectExpr("Id AS ds2_Id", "ds2_timestamp")
.withWatermark("ds2_timestamp", "20 seconds")

val output = ds1Map.join( ds2Map,
expr(
""" ds1_Id = ds2_Id AND ds2_timestamp >= ds1_timestamp AND  ds2_timestamp <= ds1_timestamp + interval 1 minutes """),
"leftOuter")

val query = output.select("*")
.writeStream

.outputMode(OutputMode.Append)
.format("console")
.option("checkpointLocation", "./spark-checkpoints/")
.start()

query.awaitTermination()

Thank you.

score 0 · Answer 1 · edited Jul 23 '19 at 17:55

This may be due to one of the caveats of the micro-batch architecture implementation as noted in the developers guide here: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#semantic-guarantees-of-stream-stream-inner-joins-with-watermarking

In the current implementation in the micro-batch engine, watermarks are advanced at the end of a micro-batch, and the next micro-batch uses the updated watermark to clean up state and output outer results. Since we trigger a micro-batch only when there is new data to be processed, the generation of the outer result may get delayed if there no new data being received in the stream. In short, if any of the two input streams being joined does not receive data for a while, the outer (both cases, left or right) output may get delayed.

This was the case for me where the null data was not getting flushed out until a further batch was triggered sometime later

Hi, I'm trying spark 2.3.0 stream to stream join with 15 min watermark. Are you getting any OOM ? Does it use a lot of memory ( mine is using 150G for each join) I'm joining two Kafka topic and the result produces to another Kafka topic. — Arnon Rodman, Jul 04 '18 at 08:44

Arnon Rodman · Answer 2 · 2019-07-25T06:55:23.300

Hi Jack and thanks for the response. question/issue was a year and a half ago and it took some time to recover what I did last year:), I run stream 2 stream join on two topics one with more the 10K sec msg and it was running on Spark cluster with 4.67 TB total memory with 1614 VCors total.

Implementation was simple structured streaming stream 2 stream join as in Spark official documents :

// Join with event-time constraints
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr("""
    clickAdId = impressionAdId AND
    clickTime >= impressionTime AND
    clickTime <= impressionTime + interval 1 hour
    """)
)

It was running for a few hours until OOM. After investigation, I found out the issue about spark clean state in HDFSBackedStateStoreProvider and the open Jira in spark :

https://issues.apache.org/jira/browse/SPARK-23682

Memory issue with spark structured streaming

And this is why I moved back and implemented stream to stream join in spark streaming 2.1.1 mapWithState.

Thx

Left outer join not emitting null values when joining two streams in spark structured streaming 2.3.0

2 Answers2

Linked