0

I don't need a constantly running cluster for processing my data, so I want, as Spark documentation suggests, use the available-now trigger:

This is useful in scenarios you want to periodically spin up a cluster, process everything that is available since the last period, and then shutdown the cluster. In some case, this may lead to significant cost savings

I don't understand, however, how time windowing works across cluster restarts. For example, if I use a 1hr window and the cluster shuts down in the middle of the window, will I always get two batches for this window (which is what I'm observing)? Is there a way to get spark maintain state while the cluster is offline? The reason this is a big issue for my case, is that I need to aggregate with a distinct count, which, itself unlike min/max(), sum() or count(), cannot be further aggregated across multiple batches.

Dmitry B.
  • 9,107
  • 3
  • 43
  • 64

1 Answers1

0

Think about the AvailableNow trigger as a trigger dedicated to batch jobs relying on the streaming capabilities, such as watermarks, stateful processing, or checkpoints. Now, when it comes to your question:

I don't understand, however, how time windowing works across cluster restarts. For example, if I use a 1hr window and the cluster shuts down in the middle of the window, will I always get two batches for this window (which is what I'm observing)?

A few key points here:

  • Structured Streaming runs mainly with a blocking micro-batch semantic, i.e. it takes all data available at the time of the micro-batch and processes it
  • the fault-tolerance is typically guaranteed by a checkpoint mechanism relying on the object stores
  • windowing is an example of the stateful operation, meaning that the checkpoint not only involves storing the offsets information about the data source but also the state
  • checkpoints are taken regularly after each micro-batch and they're blocking (although, there is an on-going effort to reduce their overhead in https://issues.apache.org/jira/browse/SPARK-39591

So to answer the question, if you cluster goes away in the middle of the window and Spark hasn't checkpointed it yet, when you restart, you will reprocess the data of this window alongside the previously saved state. It may result in generating some duplicates in the sink.

Instead of this consistency, there is another point you should be careful about for the AvailableNow trigger and stateful operations. If your operation relies on the processing time and your cluster goes down, the operation correctness may be broken because the processing time for the restarted job will change. For example:

  • job starts at 10:00
  • it fails 10:30
  • you restart it at 11:00

As a result, you may encounter incorrect behavior, such as invalidating the state that was valid for the 10 o'clock's run etc.

Bartosz Konieczny
  • 1,985
  • 12
  • 27