2

I have confusion in the number of tasks that can work in parallel in Flink,

Can someone explain to me:

  • what is the number of parallelism in a distributed system? and its relation to Flink terminology
  • In Flink, is it the same as we say 2 parallelism = 2 tasks work in parallel?
  • In Flink, if 2 operators work separately but the number of parallelism in each one of them is 1, does that count as parallel computation?
  • Is it true that in a KeyedStream, the maximum number of parallelism is the number of keys?
  • Does the Current CEP engine in Flink able to work in more than 1 task?

Thank you.

Maher Marwani
  • 43
  • 1
  • 5

1 Answers1

1

Flink uses the term parallelism in a pretty standard way -- it refers to running multiple copies of the same computation simultaneously on multiple processors, but with different data. When we speak of parallelism with respect to Flink, it can apply to an operator that has parallel instances, or it can apply to a pipeline or job (composed of a several operators).

In Flink it is possible for several operators to work separately and concurrently. E.g., in this job

source ---> map ---> sink

the source, map, and sink could all be running simultaneously in separate processors, but we wouldn't call that parallel computation. (Distributed, yes.)

In a typical Flink deployment, the number of task slots equals the parallelism of the job, and each slot is executing one complete parallel slice of the application. Each parallel instance of an operator chain will correspond to a task. So in the simple example above, the source, map, and sink can all be chained together and run in a single task. If you deploy this job with a parallelism of two, then there will be two tasks. But you could disable the chaining, and run each operator in its own task, in which case you'd be using six tasks to run the job with a parallelism of two.

Yes, with a KeyedStream, the number of distinct keys is an upper bound on the parallelism.

CEP can run in parallel if it is operating on a KeyedStream (in which case, the pattern matching is being done independently for each key).

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Thank you for your answer, I have a problem with understanding the impact of parallelism to my application. I am currently implementing a Pattern Matching Engine on top of Flink and it requires N input Streams and M output Streams. all operators are n-artery, which means there is a chance to apply an operation on more than 2 Streams. I can union the Streams and then apply the State-operation, So I have to possible solutions: 1) create virtual Keys to group streams and match patterns on each one of the streams. 2) create a unified stream that will gather all the data into one single task. – Maher Marwani Apr 12 '20 at 00:33
  • I think the question boils down to this: can you meaningfully partition the total dataflow, or do many of the patterns need to see every event across all N inputs? – David Anderson Apr 13 '20 at 08:55
  • Yes exactly, let's say: a user defines sources, patterns( complex or simple) . and the engine emits Matched Patterns. if a pattern requires 4 input streams, I need somehow to gather the 4 streams into one place in order to detect the pattern, what is the optimal way to do that? and Is there a solution when I can have shared memory for all 4 streams and at the same time can compute things in parallel ? from my knowledge the only possible solution is Union all 4 streams and create a dummy Key to use KeyedState, but what will be the impact of this in term of parallelism and scaling? – Maher Marwani Apr 13 '20 at 09:50
  • Union is the obvious way to bring the input streams together. And for doing pattern matching in parallel, you could key the unioned stream by a patternId, which will let you scale up the pattern matching. – David Anderson Apr 13 '20 at 12:05