Outer join two Datasets (not DataFrames) in Spark Structured Streaming

Question

I have some code that joins two streaming DataFrames and outputs to console.

val dataFrame1 =
  df1Input.withWatermark("timestamp", "40 seconds").as("A")

val dataFrame2 =
  df2Input.withWatermark("timestamp", "40 seconds").as("B")

val finalDF: DataFrame = dataFrame1.join(dataFrame2,
      expr(
        "A.id = B.id" +
          " AND " +
          "B.timestamp >= A.timestamp " +
          " AND " +
          "B.timestamp <= A.timestamp + interval 1 hour")
      , joinType = "leftOuter")
finalDF.writeStream.format("console").start().awaitTermination()

What I now want is to refactor this part to use Datasets, so I can have some compile-time checking.

So what I tried was pretty straightforward:

val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
      expr(
        "A.id = B.id" +
          " AND " +
          "B.timestamp >= A.timestamp " +
          " AND " +
          "B.timestamp <= A.timestamp + interval 1 hour")
      , joinType = "leftOuter")
finalDS.writeStream.format("console").start().awaitTermination()

However, this gives the following error:

org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;;

As you can see, the join code hasn't changed, so there is a watermark on both sides and a range condition. The only change was to use the Dataset API instead of DataFrame.

Also, it is fine when I use inner join:

val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
          expr(
            "A.id = B.id" +
              " AND " +
              "B.timestamp >= A.timestamp " +
              " AND " +
              "B.timestamp <= A.timestamp + interval 1 hour")
          )
    finalDS.writeStream.format("console").start().awaitTermination()

Does anyone know how can this happen?

score 3 · Accepted Answer · answered Jul 09 '18 at 11:23

Well, when you using joinWith method instead of join you rely on different implementation and it seems like this implementation not support leftOuter join for streaming Datasets.

You can check outer joins with watermarking section of the official documentation. Method join not joinWith used. Note that result type will be DataFrame. That means that you most likely will have to map field manually

val finalDS = dataFrame1.as[A].join(dataFrame2.as[B],
    expr(
      "A.key = B.key" +
        " AND " +
        "B.timestamp >= A.timestamp " +
        " AND " +
        "B.timestamp <= A.timestamp + interval 1 hour"),
    joinType = "leftOuter").select(/* useful fields */).as[C]

Thanks. Yeah, indeed I ended up doing this for the time being. Can I edit your answer to also add my solution (in which C is Tuple2[A,B], like the return type of joinWith). I'm aware in the docs that joinWith is not used, but maybe there is a way this can be achieved? I'll leave this open for the time being. — Shikkou, Jul 10 '18 at 08:01

score 0 · Answer 2 · answered Nov 19 '21 at 09:05

If you here for understnding why this exception

org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;;

still aprears while you have introduced the watermark to the join and Spark 3 supports the streams join already, you probably have added watermarking AFTER the join, but Spark want you to add watermarking BEFORE the join on each stream!

Outer join two Datasets (not DataFrames) in Spark Structured Streaming

2 Answers2

Linked