2

I have a long running structured streaming job which consumes several Kafka topics and aggregated over a sliding window. I need to understand how checkpoints are managed/cleaned up within HDFS.

Jobs run fine and I am able to recover from a failed step with no data loss, however, I can see the HDFS utilisation increasing day by day. I can not find any documentation around how Spark manages/cleans up the checkpoints. Previously the checkpoints were being stored on s3 but this turned out to be quite costly with the large amount of small files being read/written.

query = formatted_stream.writeStream \
                        .format("kafka") \
                        .outputMode(output_mode) \
                        .option("kafka.bootstrap.servers", bootstrap_servers) \
                        .option("checkpointLocation", "hdfs:///path_to_checkpoints") \
                        .start()

From what I understand the checkpoints should be cleaned up automatically; after several days I just see my HDFS utilisation increasing linearly. How can I ensure the checkpoints are managed and HDFS does not run out of space?

The accepted answer to Spark Structured Streaming Checkpoint Cleanup informs that Structured Streaming should deal with this issue, but not how or how it can be configured.

Matthew Jackson
  • 199
  • 2
  • 12
  • Possible duplicate of [Spark Structured Streaming Checkpoint Cleanup](https://stackoverflow.com/questions/48235955/spark-structured-streaming-checkpoint-cleanup) – thebluephantom Jan 07 '19 at 10:13

1 Answers1

3

As you can see in the code for Checkpoint.scala, the checkpointing mechanism persists the last 10 checkpoint data, but that should not be a problem over a couple of days.

A usual reason for this is that the RDDs you are persisting on disk are also growing linearly with time. This may be due to some RDDs that you don't care about getting persisted.

You need to make sure that from your use of Structured Streaming there are no RDDs that grow that need to be persisted. For example, if you want to calculate a precise count of distinct elements over a column of a Dataset, you need to know the full input data (which means persisting data that linearly increases with time, if you have a constant influx of data per batch). Instead, if you can work with an approximate count, you can use algorithms such as HyperLogLog++, which typically requires much less memory for a tradeoff on precision.

Keep in mind that if you are using Spark SQL, you may want to further inspect what your optimized queries turn into, as this may be related to how Catalyst optimizes your query. If you are not, then maybe Catalyst would have optimized your query for you if you did.

In any case, a further thought: if the checkpoint usage is increasing with time, this should be reflected with your streaming job also consuming more RAM linearly with time, since the Checkpoint is just a serialization of the Spark Context (plus constant-size metadata). If that is the case, check SO for related questions, such as why does memory usage of Spark Worker increase with time?.

Also, be meaningful of what RDDs you call .persist() on (and which cache level, so that you can metadata to disk RDDs and only load them partially into the Spark Context at a time).

ssice
  • 3,564
  • 1
  • 26
  • 44
  • Thanks for the insight. Regarding the persisted data, I perform an aggregation over a window and id field (min, max, first, sum, count, avg, collect_set and approx_count_distinct). I had assumed that this would not be persisted past the window + watermark? Similarly I use drop_duplicates() on a subset of columns of the watermarked stream. Could this be causing the checkpoints to be persisted? – Matthew Jackson Jan 07 '19 at 12:00
  • 2
    I did some further investigation into this and found the root cause was a call to dropDuplicates(). I had mis-understood the documentation around watermarking resulting in the snapshots incrementally growing over time. Thanks again for the insight. – Matthew Jackson Jan 14 '19 at 16:11