1

I have operator checkpointing enabled and working smoothly for a ProcessFunction operator.

On job failure I can see how operator state gets externalized on the snapshotState() hook, and on resume, I can see how state is restored at the initializeState() hook.

However when I try to implement the CheckpointedFunction interface and the 2 aforementioned methods on an AsyncFunction, it does not seem to work. I'm doing virtually the same as with the ProcessFunction ...but when the job is shutting down after failure, it does not seems to be stopping by the snapshotState() hook, and upon job resume, context.isRestored() is always false.

Why CheckpointedFunction.snapshotState() and CheckpointedFunction.initializeState() are not being executed with AsyncFunction but yes with ProcessFunction?

Edited: For some reason, my checkpoints are taking very long. My config is very standard I believe, interval of 1 second, 500ms min pause, exactly once. No other tunning.
I'm getting this traces from the checkpointing coordinator

o.a.f.s.r.t.SubtaskCheckpointCoordinatorImpl - Time from receiving all checkpoint barriers/RPC to executing it exceeded threshold: 93905ms
2021-11-23 16:25:01 INFO  o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 4 for job 239d7967eac7900b33d7eadd483c9447 (671604 bytes in 112071 ms).

If I attempt to set a checkpointTimeout, I need to set something in the order or 5 minutes or so. How come a checkpoint of such a little state (it's just a Counter and a Long) takes 5 minutes?

I've also read that NFS volumes are a recipe for troubles, but so far I haven't run this on the cluster, I'm just testing it on my local filesystem

1 Answers1

3

AsyncFunction doesn't support state at all. The reason is that state primitives are not synchronized and thus would produce incorrect results in AsyncFunction. That's the same reason why there is no KeyedAsyncFunction.

If Flink had https://cwiki.apache.org/confluence/display/FLINK/FLIP-22%3A+Eager+State+Declaration implemented then it could simply attach the state on each async call and update on successful async.

You can do some trickery with chained maps and slot sharing groups around the limitation but it's rather hacky.

Arvid Heise
  • 3,524
  • 5
  • 11
  • this https://stackoverflow.com/a/62476382/11217621 seems to imply that despite of AsyncFunction not being allowed to hold any keyed state it is still possible to mix with CheckpointedFunction ...is that incorrect? I need to retain the last element before a job failure, so that my async function can resume from there. Isn't there any way to do this? – diegoruizbarbero Nov 16 '21 at 10:24
  • I double-checked the current code base and you are right. `CheckpointedFunction` should be supported. Can you execute your code in IDE (always recommended) and check if what [this method](https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/util/functions/StreamingFunctionUtils.java#L99-L99) is doing on checkpoint? – Arvid Heise Nov 16 '21 at 12:19
  • Can you please elaborate on where and how do I wire StreamingFunctionUtils.snapshotFunctionState() and where I do get its arguments from? ...on the initializeState() of the AsyncFuntion? on every pass by asyncInvoke()? May you illustrate it with some code? – diegoruizbarbero Nov 16 '21 at 13:59
  • 1
    Sorry, I meant to set a breakpoint at this function inside your IDE when executing Flink in your IDE (or you attach a remote debugger to your task manager). There is no way that you use it. It's used internally in Flink. – Arvid Heise Nov 16 '21 at 17:58
  • That is strange, which Flink version are you using? Could you try to find out what `AsyncWaitOperator#snapshotState()` is doing in the super call? – Arvid Heise Nov 18 '21 at 12:20
  • I've seen some slow checkpointing troubleshooting and I think my config is pretty standard (1sec interval, min pause 500ms), the state size is minimal (a long timestamp). I have spotted this in the debug: 2021-11-23 16:24:43 WARN o.a.f.s.r.t.SubtaskCheckpointCoordinatorImpl - Time from receiving all checkpoint barriers/RPC to executing it exceeded threshold: 93905ms 2021-11-23 16:25:01 INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 4 for job 239d7967eac7900b33d7eadd483c9447 (671604 bytes in 112071 ms). Why is it happening? – diegoruizbarbero Nov 23 '21 at 15:37
  • 1
    I think this is an unrelated issue that you could ask on Flink mailing list. My first guess is that you have backpressure in your pipeline (probably from an asyncIO) so that checkpoint barrier takes 100s to travel. – Arvid Heise Nov 24 '21 at 12:51
  • Thank you Arvid, your hints are much appreciated – diegoruizbarbero Nov 25 '21 at 14:51