We wrote a Spark Streaming application, that receives Kafka messages (backpressure
enabled and spark.streaming.kafka.maxRatePerPartition
set), maps the DStream into a Dataset and writes this datasets to Parquet files (inside DStream.foreachRDD
) at the end of every batch.
At the beginning, everything seems fine, Spark Streaming processing time is around 10 seconds for a 30 second batch interval. The amount of produced Kafka messages is a bit less then the amount of messages we consume in our Spark application, so there's no backpressure needed (in the beginning). The Spark job creates many Parquet files inside our Spark Warehouse HDFS directory (x Partitions => x Parquet Files per Batch), as expected.
Everything runs just fine for hours, but after around 12-14 hours, our processing time increases rapidly, e.g. it jumped from the normal 10 seconds processing time to >1 minute from one batch to the next one. This of course leads to a huge batch queue after a short time.
We saw similar results for 5 minute batches (processing time is around 1.5 minutes here and suddenly increases to >10 minute per batch after period of time).
Similar results happened also when we wrote ORC instead of Parquet files.
Since the batches can run independently, we do not use the checkpointing feature of Spark Streaming.
We're using the Hortonworks Data Platform 3.1.4 with Spark 2.3.2 and Kafka 2.0.0.
Is this a known problem in Spark Streaming? Are there any dependencies on "old" batches for Parquet /ORC tables? Or is this a general file-based or Hadoop-based problem? Thanks for your help.