0

I've got Spark Structured Streaming application (Spark 2.4.5) which is consuming from Kafka. The application was down for a bit, but when I restarted it I get the below error.

I fully understand why I'm getting the error, and I'm ok with that. But I cannot seem to get around it. Based on the logs I see "Recovering from the earliest offset: 1234332978" but this does seem to be happening. I've tried deleting the 'source' folder in my checkpoint location which also didn't help.

My code is using a mapGroupWithState function, so I do have State data which I don't want to lose, as a result deleting the entire Checkpoint directory isn't my preferred approach. I have set:

.option("failOnDataLoss", false) .option("startingOffsets", "latest")

But it seems this only applies to new partitions.

Is there a way to tell Spark to just accept that there are missing offsets and continue? Or some approach to delete the offset data manually without impacting the application 'state'?

20/07/29 01:02:40 WARN InternalKafkaConsumer: Cannot fetch offset 1215191190 (GroupId: spark-kafka-source-f9562fca-ab0c-4f7a-93c3-20506cbcdeb7--1440771761-executor, TopicPartition: cmusstats-0). 
Some data may have been lost because they are not available in Kafka any more; either the
 data was aged out by Kafka or the topic may have been deleted before all the data in the
 topic was processed. If you want your streaming query to fail on such cases, set the source
 option "failOnDataLoss" to "true".
    
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cmusstats-0=1215191190}
    at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:970)
    at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:490)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1259)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1187)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
    at org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470)
    at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361)
    at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251)
    at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234)
    at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
    at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209)
    at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234)
    at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:64)
    at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:500)
    at org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.next(KafkaMicroBatchReader.scala:357)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
20/07/29 01:02:40 WARN InternalKafkaConsumer: Some data may be lost. Recovering from the earliest offset: 1234332978
20/07/29 01:02:40 WARN InternalKafkaConsumer: 
The current available offset range is AvailableOffsetRange(1234332978,1328165875).
 Offset 1215191190 is out of range, and records in [1215191190, 1215691190) will be
 skipped (GroupId: spark-kafka-source-f9562fca-ab0c-4f7a-93c3-20506cbcdeb7--1440771761-executor, TopicPartition: cmusstats-0). 
Some data may have been lost because they are not available in Kafka any more; either the
 data was aged out by Kafka or the topic may have been deleted before all the data in the
 topic was processed. If you want your streaming query to fail on such cases, set the source
 option "failOnDataLoss" to "true".
caverman
  • 55
  • 5
  • Does this answer your question? [Spark set to read from earliest offset - throws error on attempting to consumer an offset no longer available on Kafka](https://stackoverflow.com/questions/55907767/spark-set-to-read-from-earliest-offset-throws-error-on-attempting-to-consumer) – Michael Heil Jul 29 '20 at 05:02
  • 1
    @mike The solution there doesn't work for me unfortunately. In Structured Streaming, the Kafka group.id and offsets are managed by Spark, and therefore resetting on the Kafka side isn't possible. The solution relates to Kafka dstreams – caverman Jul 29 '20 at 16:13

1 Answers1

0

It turns out, that the Structured Streaming application recovered eventually. For a period of time, many errors about 'Cannot fetch offset' were being logged. But after a period of time the processing continued with the earliest offset.

I cannot explain why I got so many of these errors before the processing starting continuing but it did continue in the end.

caverman
  • 55
  • 5
  • Thanks for the update. I have the same issue. Can you give some indication on how long the period of time was? Are we talking hours? Days? – epa095 Apr 03 '23 at 12:12