7

The programming guide says that structured streaming guarantees end-to-end exactly once semantics using appropriate sources/sinks.

However I'm not understanding how this works when the job crashes and we have a watermark applied.

Below is an example of how I currently imagine it working, please correct me on any points that I'm misunderstanding. Thanks in advance!

Example:

Spark Job: Count # events in each 1 hour window, with a 1 hour Watermark.

Messages:

  • A - timestamp 10am
  • B - timestamp 10:10am
  • C - timestamp 10:20am
  • X - timestamp 12pm
  • Y - timestamp 12:50pm
  • Z - timestamp 8pm

We start the job, read A, B, C from the Source and the job crashes at 10:30am before we've written them out to our Sink.

At 6pm the job comes back up and knows to re-process A, B, C using the saved checkpoint/WAL. The final count is 3 for the 10-11am window.

Next, it reads the new messages from Kafka, X, Y, Z in parallel since they belong to different partitions. Z is processed first, so the max event timestamp gets set to 8pm. When the job reads X and Y, they are now behind the watermark (8pm - 1 hour = 7pm), so they are discarded as old data. The final count is 1 for 8-9pm, and the job does not report anything for the 12-1pm window. We've lost data for X and Y.

---End example---

Is this scenario accurate? If so, the 1 hour watermark may be sufficient to handle late/out-of-order data when flowing normally from Kafka-Sspark, but not when the spark job goes down/Kafka connection is lost for a long period of time. Would the only option to avoid data loss be to use a watermark longer than you expect the job to ever go down for?

zero323
  • 322,348
  • 103
  • 959
  • 935
Ray J
  • 805
  • 1
  • 9
  • 13
  • In my understanding, Spark would sort incoming data by watermarked field, so Z would be seen last. – nonsleepr Aug 09 '17 at 03:56
  • Is there a reference for that? From what I understand Spark will read data from different Kafka partitions in parallel, its only within a single partition that data will be processed serially. – Ray J Aug 09 '17 at 20:59
  • My assumption above is wrong: was misled by the execution plan of my (more complex than an example) job which did sorting. – nonsleepr Aug 09 '17 at 21:55

2 Answers2

8

The watermark is a fixed value during the minibatch. In your example, since X, Y and Z are processed in the same minibatch, watermark used for this records would be 9:20am. After completion of that minibatch watermark would be updated to 7pm.

Below the quote from the design doc for the feature SPARK-18124 which implements watermarking functionality:

To calculate the drop boundary in our trigger based execution, we have to do the following.

  • In every trigger, while aggregate the data, we also scan for the max value of event time in the trigger data
  • After trigger completes, compute watermark = MAX(event time before trigger, max event time in trigger) - threshold

Probably simulation would be more description:

import org.apache.hadoop.fs.Path
import java.sql.Timestamp
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.ProcessingTime

val dir = new Path("/tmp/test-structured-streaming")
val fs = dir.getFileSystem(sc.hadoopConfiguration)
fs.mkdirs(dir)

val schema = StructType(StructField("vilue", StringType) ::
                        StructField("timestamp", TimestampType) ::
                        Nil)

val eventStream = spark
  .readStream
  .option("sep", ";")
  .option("header", "false")
  .schema(schema)
  .csv(dir.toString)

// Watermarked aggregation
val eventsCount = eventStream
  .withWatermark("timestamp", "1 hour")
  .groupBy(window($"timestamp", "1 hour"))
  .count

def writeFile(path: Path, data: String) {
  val file = fs.create(path)
  file.writeUTF(data)
  file.close()
}

// Debug query
val query = eventsCount.writeStream
  .format("console")
  .outputMode("complete")
  .option("truncate", "false")
  .trigger(ProcessingTime("5 seconds"))
  .start()

writeFile(new Path(dir, "file1"), """
  |A;2017-08-09 10:00:00
  |B;2017-08-09 10:10:00
  |C;2017-08-09 10:20:00""".stripMargin)

query.processAllAvailable()
val lp1 = query.lastProgress

// -------------------------------------------
// Batch: 0
// -------------------------------------------
// +---------------------------------------------+-----+
// |window                                       |count|
// +---------------------------------------------+-----+
// |[2017-08-09 10:00:00.0,2017-08-09 11:00:00.0]|3    |
// +---------------------------------------------+-----+

// lp1: org.apache.spark.sql.streaming.StreamingQueryProgress =
// {
//   ...
//   "numInputRows" : 3,
//   "eventTime" : {
//     "avg" : "2017-08-09T10:10:00.000Z",
//     "max" : "2017-08-09T10:20:00.000Z",
//     "min" : "2017-08-09T10:00:00.000Z",
//     "watermark" : "1970-01-01T00:00:00.000Z"
//   },
//   ...
// }


writeFile(new Path(dir, "file2"), """
  |Z;2017-08-09 20:00:00
  |X;2017-08-09 12:00:00
  |Y;2017-08-09 12:50:00""".stripMargin)

query.processAllAvailable()
val lp2 = query.lastProgress

// -------------------------------------------
// Batch: 1
// -------------------------------------------
// +---------------------------------------------+-----+
// |window                                       |count|
// +---------------------------------------------+-----+
// |[2017-08-09 10:00:00.0,2017-08-09 11:00:00.0]|3    |
// |[2017-08-09 12:00:00.0,2017-08-09 13:00:00.0]|2    |
// |[2017-08-09 20:00:00.0,2017-08-09 21:00:00.0]|1    |
// +---------------------------------------------+-----+
  
// lp2: org.apache.spark.sql.streaming.StreamingQueryProgress =
// {
//   ...
//   "numInputRows" : 3,
//   "eventTime" : {
//     "avg" : "2017-08-09T14:56:40.000Z",
//     "max" : "2017-08-09T20:00:00.000Z",
//     "min" : "2017-08-09T12:00:00.000Z",
//     "watermark" : "2017-08-09T09:20:00.000Z"
//   },
//   "stateOperators" : [ {
//     "numRowsTotal" : 3,
//     "numRowsUpdated" : 2
//   } ],
//   ...
// }

writeFile(new Path(dir, "file3"), "")

query.processAllAvailable()
val lp3 = query.lastProgress

// -------------------------------------------
// Batch: 2
// -------------------------------------------
// +---------------------------------------------+-----+
// |window                                       |count|
// +---------------------------------------------+-----+
// |[2017-08-09 10:00:00.0,2017-08-09 11:00:00.0]|3    |
// |[2017-08-09 12:00:00.0,2017-08-09 13:00:00.0]|2    |
// |[2017-08-09 20:00:00.0,2017-08-09 21:00:00.0]|1    |
// +---------------------------------------------+-----+
  
// lp3: org.apache.spark.sql.streaming.StreamingQueryProgress =
// {
//   ...
//   "numInputRows" : 0,
//   "eventTime" : {
//     "watermark" : "2017-08-09T19:00:00.000Z"
//   },
//   "stateOperators" : [ ],
//   ...
// }

query.stop()
fs.delete(dir, true)

Notice how Batch 0 started with watermark 1970-01-01 00:00:00 while Batch 1 started with watermark 2017-08-09 09:20:00 (max event time of Batch 0 minus 1 hour). Batch 2, while empty, used watermark 2017-08-09 19:00:00.

Community
  • 1
  • 1
nonsleepr
  • 801
  • 9
  • 12
1

Z is processed first, so the max event timestamp gets set to 8pm.

That's correct. Even though Z may be computed first, the watermark is subtracted from the maximum timestamp in the current query iteration. This means that 08:00 PM will be set as the time in which we subtract the watermark time from, meaning 12:00 and 12:50 will be discarded.

From the documentation:

For a specific window starting at time T, the engine will maintain state and allow late data to update the state until (max event time seen by the engine - late threshold > T)


Would the only option to avoid data loss be to use a watermark longer than you expect the job to ever go down for

Not necessarily. Lets assume you set a maximum amount of data to be read per Kafka querying to 100 items. If you read small batches, and you're reading serially from each partition, each maximum timestamp for each batch may not be the maximum time of the latest message in the broker, meaning you won't lose these messages.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Sorry I'm not sure if I understand the first point; the `withWatermark` [documentation](https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/Dataset.html#withWatermark-java.lang.String-java.lang.String-) states that the max event time is set across all partitions in the current query. In that case shouldn't 20:00 be the time we subtract the watermark from, so 12:50 and 12:00 will be discarded? On the second point, that's true, throttling will help alleviate the issue. It still seems like a bit of a gamble though whether or not each batch will all fit in the watermark. – Ray J Aug 09 '17 at 21:12
  • @RayJ The watermark is a fixed value during the minibatch. in your example, since X, Y and Z are processed in the same minibatch, watermark used for them would be 9:20am. After completion of that minibatch watermark would be updated to 7pm. – nonsleepr Aug 09 '17 at 22:00
  • @RayJ The maximum event time is 12:10 pm, not 08:00 pm. Why would it subtract it from a value which isn't the maximum? – Yuval Itzchakov Aug 10 '17 at 09:08
  • @nonsleepr ah that's interesting, do you have a link to documentation or source code that mentions that? If you can post that as a separate answer I would be happy to accept it. – Ray J Aug 11 '17 at 04:38
  • @YuvalItzchakov I think we simply had a language issue, it goes 12pm (noon) then 1pm ... 8pm. Unfortunately the system doesn't make a whole lot of sense. :/ – Ray J Aug 11 '17 at 04:41
  • @RayJ Yeah, sorry, not sure why my brain was thinking 12 PM means 00:00. 8PM will be the maximum as you've said. – Yuval Itzchakov Aug 11 '17 at 11:34
  • @YuvalItzchakov No worries, that would be a lot more logical honestly – Ray J Aug 11 '17 at 21:34