In Flink datastream suppose that an upstream operator is hosted on machine/task manager m
, How does the upstream operator knows the machine (task manager) m’
on which the downstream operator is hosted. Is it during initial scheduling of the job sub/tasks (operators) by the JobManager that such data flow paths between downstream/upstream operators are established, and such data flow paths are fixed for the application lifetime?
More generally, consider Flink stateful functions where dynamic messaging is supported and data flow are not fixed or predefined, and given a function with key k
that needs to send a message/event to a another function with key k’
how would function k
finds the address of function k’
for messaging it? Does Flink runtime keeps key-machine mappings in some distributed data structure ( e.g, DHT as in Microsoft Orleans ) and every invocation of a function involves access to such data structure?
Note that I came from Spark background where given the RDD/batch model, job graph tasks are executed consecutively (broken at shuffle boundaries), and each shuffle subtasks are instructed of the machines holding the subset of keys that should be pulled/processed by that subtask….
Thank you.