Heavy back pressure and huge checkpoint size

Question

I have an Apache Flink application that I have deployed on Kinesis Data analytics.

Payload schema processed by the application (simplified version):

{
id:String= uuid (each request gets one),
category:string= uuid (we have 10 of these),
org_id:String = uuid (we have 1000 of these),
count:Integer (some integer)
}

This application is doing the following:

Source: Consume from a single Kafka topic (128 partitions)
Filter: Do some filtering for invalid records (nothing fancy here)
key-by: based on 2 fields in the input Tupe.of(org_id,category) .
Flatmap(de-duplication): Maintains a guava cache(with size 30k and expiration 5 mins). A single String ID (id in payload) field is stored in the cache. Each time a record comes in, we check if the id is present in the cache. If it is present it will be skipped. Else it will be Skipped.
Rebalance: Just to make sure some sinks don't remain idle while the others are taking all the load.
Sink: Writes to S3 (and this S3 has versioning enabled).

This is deployed with: in KDA terms: parallelism of 64 and parallelism per KPU of 2. That means we will have a cluster of 32 nodes and each node has 1 core CPU and 4GB of RAM.

All of these below mentioned issues happen at 2000 rps.

Now to the issue I am facing:

My lastCheckPointSize seems to 471MB. This seems to be very high given that we are not using any state (note: the guava cache is not stored on the state: Gist with just the interesting parts).

I see heavy back pressure. Because of this the record_lag_max builds up.

I am unable to understand why my checkpoint size so huge since I am not using any state. I was thinking, it will just be the kafka offsets processed by each of these stages. But 471MB seems too big for that.

?

Is this huge checkpoint responsible for the backpressure I am facing? When I look at s3 metrics it looks like 20ms per write, which I assume is not too much.

I am seeing a few rate limits on S3, but from my understanding this seems to pretty low compared to the number of writes I am making.

Any idea why I am facing this backpressure and also why my checkpoints are so huge?

(Edit added as an after thought)Now that I think about it, is it possible for that not marking LoaderCache as `transient’ in my DeduplicatingFlatmap playing any role in the huge checkpoint size?

Edit 2: Adding details related to my sink:

I am using a StreamingFileSink:

StreamingFileSink
  .forRowFormat(new Path(s3Bucket), new JsonEncoder<>())
  .withBucketAssigner(bucketAssigner)
  .withRollingPolicy(DefaultRollingPolicy.builder()
                .withRolloverInterval(60000)
                .build())
  .build()

The JsonEncoder takes the object and converts it to json and writes out bytes like this: https://gist.github.com/vmohanan1/3ba3feeb6f22a5e34f9ac9bce20ca7bf

The BucketAssigner gets the product and org from the schema and appends them with the processing time from context like this: https://gist.github.com/vmohanan1/8d443a419cfeb4cb1a4284ecec48fe73

Checkpoint sizes are not a clear indication of live state size, and 471MB is only about 15 MB per RocksDB instance. Perhaps the checkpoints are this large because the RocksDB instances haven't been compacted yet. On another note, which sink are you using to write to S3, and how is that configured? — David Anderson, Mar 15 '22 at 20:08
Do you know what "last checkpoint size" is measuring? Is that the total size of the checkpoint, or the incremental amount written by an incremental checkpoint? — David Anderson, Mar 15 '22 at 20:10
Hello David, Updated the question with details of my sink(Edit 2). Your explanation about the checkpoint size makes total sense.My concern is also that checkpoint duration is about 2.2M ms at peak checkpoint size. From KDA docs: lastCheckpointSize:You can use this metric to determine running application storage utilization. lastCheckpointDuration: This metric measures the time it took to complete the most recent checkpoint — Vinod Mohanan, Mar 16 '22 at 01:09
One another question I have is, I do a keyBy Tuple2.of(org_id, category). There can be 1000 org_ids and 10 categories. So basically 10k combinations. Are these too many keys and could this be the source of backpressure? — Vinod Mohanan, Mar 16 '22 at 10:00
No, it's not a problem to have lots of keys -- in fact, it's generally preferred. You are more likely to have problems with hot keys and data skew if the number of keys is too small relative to the parallelism. — David Anderson, Mar 16 '22 at 11:18
Ok, thought of checking because after keyby I run a flat map which has a guava cache (not on state. It’s basically transient). Anything you can’t spot in my sink configuration? — Vinod Mohanan, Mar 16 '22 at 13:11
The cause of the back pressure is S3 writes are getting throttled. So basically in the streamingfilesink, I set the s3 bucket name. I want the files to go to different directory structures with in the bucket based on the data. So I use the bucketassigner mention in the OP. The output of the bucket assigner will be "/product/org/2022-03-10-11". But Flink is making a HEAD request per prefix before each put. That is causing the rate limit. Any idea why this would be the case? — Vinod Mohanan, Mar 16 '22 at 22:12

Heavy back pressure and huge checkpoint size

0 Answers0

Linked