0

i am new to Flink, as part of a research I am trying to figure out : 1-How exactly Flink(am using Dataset API and just one machine) is distributing the tasks among available threads/slots, which algorithms or techniques are being used ? 2- Does Flink decide that task-A will be assigned to thread-1 or thread-2, or what ever thread is available will execute that task ?

I already did some examples and used the Web-UI to get some Info's ,but I still don't know the answers for sure.

If someone could help or know any references that would help me get more insights I will appreciate it. Thanks a lot.

Update : to offer more details and trying to explain my self in a better way , firstly the program is very simple as follows :

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(16);

DataSet<String> text = env.readTextFile(filePath);

DataSet<Tuple2<String, Integer>> wordTuples = text
.flatMap(new Tokenizer()).name("FlatMap Operation");

wordTuples.writeAsText("Path");

env.execute();

The first Image shows Info about the First Task of my Job ,each subtask get 4 records except subtask with ID-0 get nothing and Subtask with ID-13 gets 8 records, why is that happening ? who decide which Subtask or Slot should do which job ? enter image description here

The second image is the second task, now its receiving data sent from first task , same subtasks are working and with the same number of records , why is that ? enter image description here so my question again why in the first Task only one Slot were used to read the whole 5 records ? who decide which slot do which job ?

now next image is showing the output, why subtask 14 ís the one with doubled data not 13 as shown in first and second image ? In case the structure of data is important then my Data i am testing on consists of 16 lines , each line as follows : My Name Is[choose a name] Sorry for the long explanation enter image description here

Mahmoud
  • 13
  • 3

0 Answers0