i am new to Flink, as part of a research I am trying to figure out : 1-How exactly Flink(am using Dataset API and just one machine) is distributing the tasks among available threads/slots, which algorithms or techniques are being used ? 2- Does Flink decide that task-A will be assigned to thread-1 or thread-2, or what ever thread is available will execute that task ?
I already did some examples and used the Web-UI to get some Info's ,but I still don't know the answers for sure.
If someone could help or know any references that would help me get more insights I will appreciate it. Thanks a lot.
Update : to offer more details and trying to explain my self in a better way , firstly the program is very simple as follows :
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(16);
DataSet<String> text = env.readTextFile(filePath);
DataSet<Tuple2<String, Integer>> wordTuples = text
.flatMap(new Tokenizer()).name("FlatMap Operation");
wordTuples.writeAsText("Path");
env.execute();
The first Image shows Info about the First Task of my Job ,each subtask get 4 records except subtask with ID-0 get nothing and Subtask with ID-13 gets 8 records, why is that happening ? who decide which Subtask or Slot should do which job ?
The second image is the second task, now its receiving data sent from first task , same subtasks are working and with the same number of records , why is that ?
so my question again
why in the first Task only one Slot were used to read the whole 5 records ? who decide which slot do which job ?
now next image is showing the output, why subtask 14 ís the one with doubled data not 13 as shown in first and second image ?
In case the structure of data is important then my Data i am testing on consists of 16 lines , each line as follows :
My Name Is[choose a name]
Sorry for the long explanation