Partition the whole dataStream in flink at the start of source and maintain the partition till sink

Question

I am consuming trail logs from a Queue (Apache Pulsar). I use 5 keyedPrcoessFunction and finally sink the payload to Postgres Db. I need ordering per customerId for each of the keyedProcessFunction. Right now I achieve this by

Datasource.keyBy(fooKeyFunction).process(processA).keyBy(fooKeyFunction).process(processB).keyBy(fooKeyFunction).process(processC).keyBy(fooKeyFunction).process(processE).keyBy(fooKeyFunction).sink(fooSink).

processFunctionC is very time consuming and takes 30 secs on worst-case to finish. This leads to backpressure. I tried assigning more slots to processFunctionC but my throughput never remains constant. it mostly remains < 4 messages per second.

Current slot per processFunction is

processFunctionA: 3
processFunctionB: 30
processFunctionc: 80
processFunctionD: 10
processFunctionC: 10

In Flink UI it shows backpressure starting from the processB, meaning C is very slow. Is there a way to use apply partitioning logic at the source itself and assing the same slots per task to each processFunction. For example:

dataSoruce.magicKeyBy(fooKeyFunction).setParallelism(80).process(processA).process(processB).process(processC).process(processE).sink(fooSink).

This will lead to backpressure to happen for only a few of the tasks and not skew the backpressure which is caused by multiple KeyBy.

Another approach that I can think of is to combine all my processFunction and sink into single processFunction and apply all those logic in the sink itself.

What happens if you just set the parallelism of the whole pipeline to 80 on the environment? — aljoscha, Jun 11 '20 at 09:28

score 1 · Answer 1 · answered Jun 12 '20 at 23:28

I don't think there exists anything quite like this. The thing that is the closest is DataStreamUtils.reinterpretAsKeyedStream, which recreates the KeyedStream without actually sending any data between the operators since it's using the partitioner that only forwards data locally. This is more or less something You wanted, but it still adds partitioning operator and under the hood recreates the KeyedStream, but it should be simpler and faster and perhaps it will solve the issue You are facing.

If this does not solve the issue, then I think the best solution would be to group operators so that the backpressure is minimalized as You suggested i.e. merge all operators into one bigger operator, this should minimize backpressure.

Partition the whole dataStream in flink at the start of source and maintain the partition till sink

1 Answers1