One of our Apache beam job running through the FlinkRunner is experiencing an odd behavior with checkpoint size. The state backend is Filebased. The job receives traffic once a day for a period of an hour and then is idle until it receives more data.
The pipeline makes no use of any windowing strategy. It is simply reading from a source, combining different values from that source and writing this to a sink. We are only recording state in the unbounded source.
It slowly increments in size as we process more data however, the size of the checkpoint does not decrease significantly once data has stopped being consumed.
We thought it could potentially be a bottle neck with the Database sink however the same behavior is present if we remove the sink and simply dump the data.
The behavior seems to resemble a stepped graph e.g.
- checkpoint = 120KB (starting size checkpoint)
- checkpoint = 409MB (starts receiving data)
- checkpoint = 850MB (processing the backlog data)
- checkpoint = 503MB (finished processing data)
- checkpoint = 1.2GB (begins processing new data and backlog)
- checkpoint = 700MB (finished processing data)
- checkpoint = 700MB (new starting size for checkpoint)
- ...
Has anyone see this behavior before? is this a known issue with Flink checkpointing using Apache beam.