How does TM recovery handle past broadcasted data

Question

In the context of HA of TaskManagers(TM), when a TM goes down a new one will be restored from latest checkpoint of faulted by the JobManager(JM).

Say we have 3 TMs (tm1, tm2, & tm3) At a give time t where everyone's checkpoint(cp) is at cp1. All TMs broadcast data among them.

Now tm2 went down, JM brought tm2' with cp1 checkpoint as part of HA. By the time t+x a new TM is brought up, in the mean time others progressed to cp2.

How's does the data broadcasted by tm1 and tm3 as part of cp2 is replayed on tm2'?

David Anderson · Answer 1 · 2020-07-31T13:37:16.487

1

The contents of checkpoints are determined by checkpoint barriers. A given checkpoint includes exactly the effects throughout the entire cluster of everyone having processed all events up to the corresponding barrier, and none of the events after that barrier.

During a restore, the entire cluster is reset to the contents of the most recent checkpoint, and processing then resumes from that consistent starting point.

Broadcast data is checkpointed more or less like everything else, except that each instance stores its own copy of the broadcast data -- with the expectation that these copies are identical. During recovery, the broadcast source is rewound to the point recorded in the checkpoint, and the broadcast state is also recovered from the checkpoint. Any new instance (due to scaling up the cluster) will get a copy of the broadcast state (taken by reading the state intended for one of the other instances).

It may be that at the time of a failure, some machines have completed a new checkpoint, but a checkpoint will not be used for a restore unless every TM has completed that checkpoint, and the Job Manager has finalized it.

edited Jul 31 '20 at 13:37

answered Jul 30 '20 at 20:13

David Anderson

39,434
4
33
60

_It may be that at the time of a failure, some machines have completed a new checkpoint_ how do things work in case of **exactly once** processing? – ardhani Jul 31 '20 at 10:48
Exactly once works as I have described; at least once is simpler. A thorough explanation of Flink's checkpointing is bit too much for a stackoverflow question/answer. Maybe start here -- https://ci.apache.org/projects/flink/flink-docs-stable/learn-flink/fault_tolerance.html -- if you want to learn more. – David Anderson Jul 31 '20 at 11:42
I've added a paragraph to explain how broadcast state is handling during checkpointing and recovery. – David Anderson Jul 31 '20 at 13:37
Thanks @David, I have one simple yes or no question. Exactly once processing can be achieved in unaligned checkpointing ? – ardhani Aug 01 '20 at 18:35
Yes. The new unaligned checkpointing introduced in Flink 1.11 is simply an alternative approach that also provides exactly once semantics. – David Anderson Aug 01 '20 at 19:11

How does TM recovery handle past broadcasted data

1 Answers1