1

I'm trying to implement a stream-stream join toy with Spark 2.3.0

The stream joins work fine when the condition matches, but lost the left stream value when the condition mismatched even using leftOuterJoin.

Thanks in advance

Here are my source code and data, basically, I'm creating two sockets, one is 9999 as right stream source and 9998 as left stream source.

val spark = SparkSession
      .builder
      .appName("StreamStream")
      .master("local")
      .getOrCreate()

    import spark.implicits._

    spark.sparkContext.setLogLevel("ERROR")

    val s9999: DataFrame = spark
      .readStream
      .format("socket")
      .option("host", "localhost")
      .option("port", 9999)
      .load()

    val s9999Dataset: Dataset[S9999] = s9999
      .map(line => {
        val strings = line.get(0).toString.split(",")
        val id = strings(0).toInt
        val time = Timestamp.valueOf(strings(1))
        S9999(id, time)
      })
      .withWatermark("timestamp99", "30 seconds")

    val s9998Dataset: Dataset[S9998] = spark
      .readStream
      .format("socket")
      .option("host", "localhost")
      .option("port", 9998)
      .load()
      .map(line => {
        val strings = line.get(0).toString.split(",")
        val id = strings(0).toInt
        val time = Timestamp.valueOf(strings(1))
        S9998(id, time)
      })

    val resultDataset = s9998Dataset
      .join(s9999Dataset,
        joinExprs = expr(
          """
                id99 = id98 AND
                timestamp98 >= timestamp99 AND
                timestamp98 <= timestamp99 + interval 6 seconds
        """),
        joinType = "leftOuter")

    val streamingQuery = resultDataset
      .writeStream
      .outputMode("append")
      .format("console")
      .start()

    streamingQuery.awaitTermination()
  }

  case class S9999(id99: Int, timestamp99: Timestamp)

  case class S9998(id98: Int, timestamp98: Timestamp)

Sample Data:

left socket:

1,2011-10-02 18:50:20.123
2,2011-10-02 18:50:25.123
3,2011-10-02 18:50:30.123
4,2011-10-02 18:50:35.123
5,2011-10-02 18:50:40.123
6,2011-10-02 18:50:45.123
7,2011-10-02 18:50:50.123
8,2011-10-02 18:50:55.123
9,2011-10-02 18:51:00.123
10,2011-10-02 18:51:05.123
11,2011-10-02 18:51:10.123
12,2011-10-02 18:51:15.123
13,2011-10-02 18:51:20.123
14,2011-10-02 18:51:25.123
15,2011-10-02 18:51:30.123

right stream data:

1,2011-10-02 18:50:20.123
3,2011-10-02 18:50:30.123
7,2011-10-02 18:50:50.123
8,2011-10-02 18:50:55.123
9,2011-10-02 18:51:00.123
13,2011-10-02 18:51:20.123
14,2011-10-02 18:51:25.123
15,2011-10-02 18:51:30.123
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Xu Yan
  • 13
  • 4
  • https://stackoverflow.com/questions/49663122/left-outer-join-not-emitting-null-values-when-joining-two-streams-in-spark-struc ? https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#outer-joins-with-watermarking ? – philipxy Feb 27 '21 at 21:36
  • @philipxy have look at my sample data for the right stream, the watermark set 30s but the data I post here is about 1 min, it does also not work If I put some later timestamp, e.g. 2011-10-02 18:53:30.123. "In short, if any of the two input streams being joined does not receive data for a while, the outer (both cases, left or right) output may get delayed." as Bah91 suggested but when the null value will be emitted? – Xu Yan Feb 28 '21 at 02:32
  • LEFT JOIN ON returns INNER JOIN ON rows UNION ALL unmatched left table rows extended by NULLs. Whether a given left table row is unmatched may not be known until after getting all right table rows. No null-extended row can be emitted until some stream is known to be finished. Shouldn't you watermark 99? Your loop is driven by left/99 rows so shouldn't you join until after all right/98 rows? I don't see that you do, but all I know about this is streaming, joining, this code & the left join manual section. (It's not clear from your post whether you also have problems with INNER JOIN.) – philipxy Feb 28 '21 at 04:28
  • 1
    @philipxy watermark on right is mandatory, and 98 is the left stream, 99 is right; inner join is fine. – Xu Yan Feb 28 '21 at 06:25

1 Answers1

0

After spending 6 hours on this question, I found the left side optional watermark is actually mandatory

Xu Yan
  • 13
  • 4
  • 2
    This would be a lot more useful answer if you explained that more clearly & connected it to your code. – philipxy Feb 28 '21 at 07:37