I have operator checkpointing enabled and working smoothly for a ProcessFunction
operator.
On job failure I can see how operator state gets externalized on the snapshotState()
hook, and on resume, I can see how state is restored at the initializeState()
hook.
However when I try to implement the CheckpointedFunction
interface and the 2 aforementioned methods on an AsyncFunction
, it does not seem to work. I'm doing virtually the same as with the ProcessFunction
...but when the job is shutting down after failure, it does not seems to be stopping by the snapshotState()
hook, and upon job resume, context.isRestored()
is always false.
Why CheckpointedFunction.snapshotState()
and CheckpointedFunction.initializeState()
are not being executed with AsyncFunction
but yes with ProcessFunction
?
Edited:
For some reason, my checkpoints are taking very long. My config is very standard I believe, interval of 1 second, 500ms min pause, exactly once. No other tunning.
I'm getting this traces from the checkpointing coordinator
o.a.f.s.r.t.SubtaskCheckpointCoordinatorImpl - Time from receiving all checkpoint barriers/RPC to executing it exceeded threshold: 93905ms
2021-11-23 16:25:01 INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 4 for job 239d7967eac7900b33d7eadd483c9447 (671604 bytes in 112071 ms).
If I attempt to set a checkpointTimeout, I need to set something in the order or 5 minutes or so. How come a checkpoint of such a little state (it's just a Counter and a Long) takes 5 minutes?
I've also read that NFS volumes are a recipe for troubles, but so far I haven't run this on the cluster, I'm just testing it on my local filesystem