3

We wrote a Spark Streaming application, that receives Kafka messages (backpressure enabled and spark.streaming.kafka.maxRatePerPartition set), maps the DStream into a Dataset and writes this datasets to Parquet files (inside DStream.foreachRDD) at the end of every batch.

At the beginning, everything seems fine, Spark Streaming processing time is around 10 seconds for a 30 second batch interval. The amount of produced Kafka messages is a bit less then the amount of messages we consume in our Spark application, so there's no backpressure needed (in the beginning). The Spark job creates many Parquet files inside our Spark Warehouse HDFS directory (x Partitions => x Parquet Files per Batch), as expected.

Everything runs just fine for hours, but after around 12-14 hours, our processing time increases rapidly, e.g. it jumped from the normal 10 seconds processing time to >1 minute from one batch to the next one. This of course leads to a huge batch queue after a short time.

We saw similar results for 5 minute batches (processing time is around 1.5 minutes here and suddenly increases to >10 minute per batch after period of time).

Similar results happened also when we wrote ORC instead of Parquet files.

Since the batches can run independently, we do not use the checkpointing feature of Spark Streaming.

We're using the Hortonworks Data Platform 3.1.4 with Spark 2.3.2 and Kafka 2.0.0.

Is this a known problem in Spark Streaming? Are there any dependencies on "old" batches for Parquet /ORC tables? Or is this a general file-based or Hadoop-based problem? Thanks for your help.

D. Müller
  • 3,336
  • 4
  • 36
  • 84
  • _"batches can run independently"_ - have you checked that? Because it certainly sounds like without checkpointing your DAG is growing uncontrollably. – mazaneicha Apr 21 '20 at 22:07
  • How / where can we check this? Is there a way in the Spark UI? – D. Müller Apr 22 '20 at 05:10
  • For us, it looks like the runtime depends on the amount of written Parquet / ORC files. Is there a knows issue for appending this to an existing Spark table? – D. Müller Apr 22 '20 at 06:45
  • What number of files are we talking about? The issue I am aware of is mentioned in https://spark.apache.org/docs/latest/streaming-programming-guide.html _"checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects"_ and discussed elsewhere on SO, for example https://stackoverflow.com/questions/31694282/running-in-deadlock-while-doing-streaming-aggregations-from-kafka. That said, I'd be very interested myself if you find whats wrong :) – mazaneicha Apr 22 '20 at 14:06
  • We're still on searching for results. Looking at the issue with the huge amount of Parquet files, we found following interesting issue: https://issues.apache.org/jira/browse/SPARK-21177. We will examine both cases (DAG too big and amount of files). We're talking about ~5000 (small) Parquet files (we'll also try to increase them by using a higher Batch interval). – D. Müller Apr 22 '20 at 14:14

0 Answers0