Spark structured streaming with kafka throwing error after running for a while

Question

I am observing weired behaviour while running spark structured streaming program. I am using S3 bucket for metadata checkpointing. The kafka topic has 310 partitions.

When i start streaming job for the first time, after completion of every batch spark creates a new file named after batch_id gets created in offset directory at checkpinting location. After successful completion of few batches, spark job fails after few retries giving warning "WARN KafkaMicroBatchReader:66 - Set(logs-2019-10-04-77, logs-2019-10-04-85, logs-2019-10-04-71, logs-2019-10-04-93, logs-2019-10-04-97, logs-2019-10-04-101, logs-2019-10-04-89, logs-2019-10-04-81, logs-2019-10-04-103, logs-2019-10-04-104, logs-2019-10-04-102, logs-2019-10-04-98, logs-2019-10-04-94, logs-2019-10-04-90, logs-2019-10-04-74, logs-2019-10-04-78, logs-2019-10-04-82, logs-2019-10-04-86, logs-2019-10-04-99, logs-2019-10-04-91, logs-2019-10-04-73, logs-2019-10-04-79, logs-2019-10-04-87, logs-2019-10-04-83, logs-2019-10-04-75, logs-2019-10-04-92, logs-2019-10-04-70, logs-2019-10-04-96, logs-2019-10-04-88, logs-2019-10-04-95, logs-2019-10-04-100, logs-2019-10-04-72, logs-2019-10-04-76, logs-2019-10-04-84, logs-2019-10-04-80) are gone. Some data may have been missed. Some data may have been lost because they are not available in Kafka any more; either the data was aged out by Kafka or the topic may have been deleted before all the data in the topic was processed. If you want your streaming query to fail on such cases, set the source option "failOnDataLoss" to "false"."

The weired thing here is previous batch's offset file contains partition info of all 310 partitions but current batch is reading only selected partitions(see above warning message). I reran the job by setting ".option("failOnDataLoss", false)" but got same warning above without job failure. It was observed that spark was processing correct offsets for few partitions and for rest of the partitions it was reading from starting offset(0). There were no connection issues with spark-kafka while this error coming (we checked kafka logs also).

Could someone help with this?Am i doing something wrong or missing something?

Below is the read and write stream code snippet.

val kafkaDF = ss.readStream.format("kafka")
    .option("kafka.bootstrap.servers", kafkaBrokers /*"localhost:9092"*/)
    .option("subscribe", logs)
    .option("fetchOffset.numRetries",5)
    .option("maxOffsetsPerTrigger", 30000000)
    .load()

val query = logDS
    .writeStream
    .foreachBatch {
      (batchDS: Dataset[Row], batchId: Long) =>
         batchDS.repartition(noofpartitions, batchDS.col("abc"), batchDS.col("xyz")).write.mode(SaveMode.Append).partitionBy("date", "abc", "xyz").format("parquet").saveAsTable(hiveTableName /*"default.logs"*/)
    }
    .trigger(Trigger.ProcessingTime(1800 + " seconds"))
    .option("checkpointLocation", s3bucketpath)
    .start()

Thanks in advance.

Can you check if those messages are available on kafka? I mean look at the last offsets committed. — Piyush Patel, Oct 07 '19 at 17:58
Hi Piyush P, thanks for the response. Yes those messages are available in kafka. There is no option for committing the offsets in kafka when using spark structured streaming. — unknown_k, Oct 09 '19 at 05:54
I agree but still kafka auto commits those messages, so you can still take a look at the message offsets available now. You can check lag of messages. — Piyush Patel, Oct 09 '19 at 14:46

Spark structured streaming with kafka throwing error after running for a while

0 Answers0