Let me just put it out, I am a very beginner of Flink and trying to grab concepts as much as possible.
Lets say, I have a flink cluster with 10 task managers. I have a flink job running on each. The job uses a broadcast state as well. This broadcast state is created by reading 5 S3 files every 10mins, do some processing, and create map of int to list of strings
which is broadcasted.
Question: Where does reading of files happen, is it at JobManager who reads and processes the file and sends over the processed content to task managers.
Or
is it the task managers who does all reading and processing. If it is this case, then how does flink make sure that if a task manager fail to read from S3, the broadcast state is same at all task managers.
EDIT
so task manager read the broadcast stream and broadcast it to the downstream tasks.
Eg. Let's say there is a Kafka stream with 5 partitions which need to be broadcasted. There is a downstream operator with a parallelism of 5 as well.
- Partition 1 consumer task, reads element from stream and set it in broadcast state. As soon as this is set, the state is broadcasted to all downstream operator 5 tasks.
- Partition 2 consumer task, reads element from stream and set it in broadcast state.
Question: At this point, do we need to make sure that we do not overwrite the elements from partition 1, when we set the broadcast state from partition 2 element or flink itself manages this.
OR
Also how can we be sure that by the time partition 2 consumed an element and set the broadcast state, the partition 1 broadcasted state had reached partition 2 downstream operator task.