Apache beam job on Flink checkpoint size growing over time

Question

One of our Apache beam job running through the FlinkRunner is experiencing an odd behavior with checkpoint size. The state backend is Filebased. The job receives traffic once a day for a period of an hour and then is idle until it receives more data.

The pipeline makes no use of any windowing strategy. It is simply reading from a source, combining different values from that source and writing this to a sink. We are only recording state in the unbounded source.

It slowly increments in size as we process more data however, the size of the checkpoint does not decrease significantly once data has stopped being consumed.

We thought it could potentially be a bottle neck with the Database sink however the same behavior is present if we remove the sink and simply dump the data.

The behavior seems to resemble a stepped graph e.g.

checkpoint = 120KB (starting size checkpoint)
checkpoint = 409MB (starts receiving data)
checkpoint = 850MB (processing the backlog data)
checkpoint = 503MB (finished processing data)
checkpoint = 1.2GB (begins processing new data and backlog)
checkpoint = 700MB (finished processing data)
checkpoint = 700MB (new starting size for checkpoint)
...

Has anyone see this behavior before? is this a known issue with Flink checkpointing using Apache beam.

From your description, it's not clear to me how the job should know whether it can delete data in the state or not. Could you update the description to reflect any kind of eviction strategy? It's also not clear what the state is at all. Do you have user state or only operator state? If the former, how is your state looking? If the latter, which kind of aggregation and windowing operations are you performing? — Arvid Heise, Apr 22 '20 at 11:49
@ArvidHeise I have updated the description to include more details. The pipeline makes use of no windowing strategy and we are only recording the state of the unbounded source. Hope this helps. — TheFlyingFox, Apr 29 '20 at 15:17
Did you find any solution? @theflyingfox We have similar issue, reading from Kafka and Windowing. — Mostafa Aghajani, May 06 '21 at 13:00

Apache beam job on Flink checkpoint size growing over time

0 Answers0