Streaming job finish before write incremental data

Question

I'm having a problem with a stream job with trigger.once When I run it for the first time, it works fine, write all available data on the path and finish. But on the next day, when there is new data available in the original path, the stream doesn't see it and finish the query before write any data. I'm using autoloader and a sqs queue. The checkpoint path is right, and has the folders offset, commit..

score 0 · Answer 1 · answered Feb 09 '22 at 14:57

The behavior you expect should be without trigger like below:

// Default trigger (runs micro-batch as soon as it can)
df.writeStream
  .format("console")
  .start()

The differenct between triggers as below:

// Default trigger (runs micro-batch as soon as it can)
df.writeStream
  .format("console")
  .start()

// ProcessingTime trigger with two-seconds micro-batch interval
df.writeStream
  .format("console")
  .trigger(Trigger.ProcessingTime("2 seconds"))
  .start()

// One-time trigger
df.writeStream
  .format("console")
  .trigger(Trigger.Once())
  .start()

// Continuous trigger with one-second checkpointing interval
df.writeStream
  .format("console")
  .trigger(Trigger.Continuous("1 second"))
  .start()

You can refer https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers

Streaming job finish before write incremental data

1 Answers1