Think about the AvailableNow
trigger as a trigger dedicated to batch jobs relying on the streaming capabilities, such as watermarks, stateful processing, or checkpoints. Now, when it comes to your question:
I don't understand, however, how time windowing works across cluster restarts. For example, if I use a 1hr window and the cluster shuts down in the middle of the window, will I always get two batches for this window (which is what I'm observing)?
A few key points here:
- Structured Streaming runs mainly with a blocking micro-batch semantic, i.e. it takes all data available at the time of the micro-batch and processes it
- the fault-tolerance is typically guaranteed by a checkpoint mechanism relying on the object stores
- windowing is an example of the stateful operation, meaning that the checkpoint not only involves storing the offsets information about the data source but also the state
- checkpoints are taken regularly after each micro-batch and they're blocking (although, there is an on-going effort to reduce their overhead in https://issues.apache.org/jira/browse/SPARK-39591
So to answer the question, if you cluster goes away in the middle of the window and Spark hasn't checkpointed it yet, when you restart, you will reprocess the data of this window alongside the previously saved state. It may result in generating some duplicates in the sink.
Instead of this consistency, there is another point you should be careful about for the AvailableNow
trigger and stateful operations. If your operation relies on the processing time and your cluster goes down, the operation correctness may be broken because the processing time for the restarted job will change. For example:
- job starts at 10:00
- it fails 10:30
- you restart it at 11:00
As a result, you may encounter incorrect behavior, such as invalidating the state that was valid for the 10 o'clock's run etc.