Spark Structured Streaming Kafka error -- offset was changed

Question

My Spark Structured Streaming application runs for a few hours before it fails with this error

java.lang.IllegalStateException: Partition [partition-name] offset was changed from 361037 to 355053, some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".

The offsets are different every time of course, but the first one is always larger than the second. Topic data can't have expired because the topic's retention period is 5 days and I recreated this topic yesterday, but the error occurred again today. The only way to recover from this is to delete the checkpoints.

Spark's Kafka integration guide mentions under failOnDataLoss option:

Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. Batch queries will always fail if it fails to read any data from the provided offsets due to lost data.

But I can't find any further information on when this can be considered a false alarm, so I don't know whether it's safe to set failOnDataLoss to false or if there's an actual problem with my cluster (in which case we'll actually be losing data).

UPDATE: I've investigated Kafka logs and in all cases where Spark has failed, Kafka has logged several message like this (one per Spark consumer I'd assume):

INFO [GroupCoordinator 1]: Preparing to rebalance group spark-kafka-...-driver-0 with old generation 1 (__consumer_offsets-25) (kafka.coordinator.group.GroupCoordinator)

With retention period, you should check `log.retention.bytes`, `log.cleaner.enable`, `log.cleaner.min.compaction.lag.ms` and `cleanup.policy` of kafka. We faced similar issue and adjusting above properties gave us expected result with `failOnDataLoss` `true` — S S, Mar 08 '19 at 13:34
I don't have `log.retention.bytes` set, so it should only delete logs based on retention period. Log compaction shouldn't change offsets, according to this http://cloudurable.com/blog/kafka-architecture-log-compaction/index.html ? — lfk, Mar 12 '19 at 04:00
I had the similar issue, where offset will all of a sudden jump back to way before. still can't explain this tho. — linehrr, May 30 '19 at 16:08
Hmm that's weird. I haven't seen this error once since making those changes. Are your executors dying for any other reason (e.g. out of memory)? You should be able to check that on Spark History Server. — lfk, Jun 05 '19 at 02:10

score 0 · Answer 1 · answered Jun 03 '19 at 05:12

I'm not having this issue any more. I made two changes:

Disabled YARN's dynamic resource allocation (which means I have to manually calculate the optimal number of executors etc. and pass them to spark-submit)
Upgraded to Spark 2.4.0, which also upgrades Kafka client from 0.10.0.1 to 2.0.0

Disabling dynamic resource allocation means executors (=consumers) are not created and terminated as the application runs, eliminating the need for rebalancing. So this is most likely what stopped the error from happening.

Michael Heil · Answer 2 · 2021-01-13T14:25:23.937

This seems to be a known bug in older versions of Spark and the spark-sql-kafka library.

I find the following JIRA tickets relevant:

SPARK-28641: MicroBatchExecution committed offsets greater than available offsets
SPARK-26267: Kafka source may reprocess data
KAFKA-7703: KafkaConsumer.position may return a wrong offset after "seekToEnd" is called

In short, citing the developers who worked on it:

"this is a know issue in Kafka, please see KAFKA-7703. This is fixed in 2.4.1 and 3.0.0 in SPARK-26267. Please upgrade Spark to higher versions. The other possibility is to upgrade Kafka to 2.3.0 where the Kafka side is fixed."

"KAFKA-7703 only exists in Kafka 1.1.0 and above, so a possible workaround is using an old version that doesn't have this issue. This doesn't impact Spark 2.3.x and below as we use Kafka 0.10.0.1 by default."

In our case, we faced the same issue on our HDP 3.1 platform. There we have Spark 2.3.2 and the spark-sql-kafka library (https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10_2.11/2.3.2.3.1.0.0-78), however, uses kafka-clients 2.0.0. Which means we face this bug due to subsequent conditions:

Our Spark < 2.4.1
1.1.0 < Our Kafka < 2.3.0

Work-around solution

We were able to solve this issue by removing the checkpoint file in the "offset" sub-folder of the batch number that contains the 0 offset.

When deleting this one file make sure that the batch number in the checkpoint files in sub-folders "commits" and "offset" still match after the deletion.

Spark Structured Streaming Kafka error -- offset was changed

2 Answers2

Work-around solution