0

I'm trying to understand what are the important features I need to take into consideration before submitting a Flink job.

My question is what is the number of parallelism, is there an upper bound(physically)? and how can the parallelism impact the performance of my job?

For example, I have a CEP Flink job that detects a pattern from unkeyed Stream, the number of parallelism will always be 1 unless I partition the datastream with KeyBy operator.

Plz Correct me if I'm wrong :

If I partition the data stream, then I will have a number of parallelism equals to the number of different keys. but the problem is that the pattern matching is being done independently for each key so I can't define a pattern that requires information from 2 partitions that have different keys.

Maher Marwani
  • 43
  • 1
  • 5

1 Answers1

0

It's not bad to use Flink with parallelism = 1. But it defeats the main purpose of using Flink (being able to scale).

In general, you should not have a higher parallelism than your cores (physical or virtual depends on the use case) as you want to saturate your cores as much as possible. Anything over that will negatively impact your performance as it requires more communication overhead and context switching. By scaling out, you can add cores from distributed compute nodes in a network, which is the main benefit of using big data technologies vs. writing application by hand.

As you said you can only use the parallelism if you partition your data. If you have an algorithm that needs all data, you need to process it on one core eventually. However, usually you can do lots of preprocessing (filtering, transformation) and partial aggregations in parallel before combining the data at a final core. For example, think of simply counting all events. You can count the data of each partition and then simply sum up the partial counts in a final step, which scales almost perfectly.

If your algorithm does not allow splitting it up, then your use case may not allow distributed processing. In that case, Flink is not a good fit. However, it's worth exploring if alternative algorithms (sometimes approximate) would suffice your use case as well. That's the art of data engineering to split monolithic algorithms into parallelizable sub-algorithms.

Arvid Heise
  • 3,524
  • 5
  • 11
  • It seems to me that some applications may benefit from using Flink to easily organize the concurrent execution of fault tolerant pipeline stages, even if the parallelism of each task is only one. But that’s hardly a mainstream use case. – David Anderson May 15 '20 at 07:36
  • thank you for this information, so if I understand right, using CEP Flink have some limitation such as scalability if the input is unkeyed Stream, right? – Maher Marwani May 15 '20 at 08:20
  • It depends on the pattern. If it uses decomposable functions (see my count example), then it will scale. You can easily check that in the UI when executing if all subtasks get data or not. In general, if your algorithm is unkeyed and cannot be decomposed, don't expect it to process larger data volumes. Unless I have a very specific use case, I always try to find a scalable algorithm, as data volume grows much faster than the power of a single CPU. – Arvid Heise May 15 '20 at 18:50