2

I know that Flink comes with custom partitioning APIs. However, the problem is that, after invoking partitionCustom on a DataStream you get a DataStream back and not a KeyedStream.

On the other hand, you cannot override the partitioning strategy for a KeyedStream.

I do want to use KeyedStream, because the API for DataStream does not have reduce and sum operators and because of automatically partitioned internal state.

I mean, if the word count is:

words.map(s -> Tuple2.of(s, 1)).keyBy(0).sum(1)

I wish I could write:

words.map(s -> Tuple2.of(s, 1)).partitionCustom(myPartitioner, 0).sum(1)

Is there any way to accomplish this?

Thank you!

affo
  • 453
  • 3
  • 15

1 Answers1

0

From Flink's documentation (as of Version 1.2.1), what partitioners do is to partition data physically with respect to their keys, only specifying their locations stored in the partition physically in the machine, which actually have not logically grouped the data to keyed stream. To do the summarization, we still need to group them by keys using "keyBy" operator, then we are allowed to do the "sum" operations. Details please refer to "https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning" :)