My Spark Structured Streaming application runs for a few hours before it fails with this error
java.lang.IllegalStateException: Partition [partition-name] offset was changed from 361037 to 355053, some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
The offsets are different every time of course, but the first one is always larger than the second. Topic data can't have expired because the topic's retention period is 5 days and I recreated this topic yesterday, but the error occurred again today. The only way to recover from this is to delete the checkpoints.
Spark's Kafka integration guide mentions under failOnDataLoss
option:
Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. Batch queries will always fail if it fails to read any data from the provided offsets due to lost data.
But I can't find any further information on when this can be considered a false alarm, so I don't know whether it's safe to set failOnDataLoss
to false
or if there's an actual problem with my cluster (in which case we'll actually be losing data).
UPDATE: I've investigated Kafka logs and in all cases where Spark has failed, Kafka has logged several message like this (one per Spark consumer I'd assume):
INFO [GroupCoordinator 1]: Preparing to rebalance group spark-kafka-...-driver-0 with old generation 1 (__consumer_offsets-25) (kafka.coordinator.group.GroupCoordinator)