0

I have a question regarding sharding data in a Kinesis stream. I would like to use a random partition key when sending user data to my kinesis stream so that the data in the shards is evenly distributed. For the sake of making this question simpler, I would then like to aggregate the user data by keying off of a userId in my Flink application.

My question is this: if the shards are randomly partitioned so that data for one userId is spread across multiple Kinesis shards, can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task? Or, do I need to shard the kinesis stream by user id before it is consumed by Flink?

ChrisATX
  • 109
  • 1
  • 11

1 Answers1

1

... can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task?

The effect of a keyBy(e -> e.userId), if you use Flink's DataStream API, is to redistribute all of the events so that all events for any particular userId will be streamed to the same downstream aggregator task.

Would each host read in data from a subset of the shards in the stream and would Flink then use the keyBy operator to pass messages of the same key to the host that will perform the actual aggregation?

Yes, that's right.

If, for example, you have 8 physical hosts, each providing 8 slots for running the job, then there will be 64 instances of the aggregator task, each of which will be responsible for a disjoint subset of the key space.

Assuming there are more than 64 shards available to read from, then each in each of the 64 tasks, the source will read from one or more shards, and then distribute the events it reads according to their userIds. Assuming the userIds are evenly spread across the shards, then each source instance will find that a few of the events it reads are for userIds it is been assigned to handle, and the local aggregator should be used. The rest of the events will each need to be sent out to one of the other 63 aggregators, depending on which worker is responsible for each userId.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • How does this scale? Eventually, there would need to be multiple physical hosts in a Flink cluster to handle a large throughput of data. How is the work divided across multiple hosts? Would each host read in data from a subset of the shards in the stream and would Flink then use the keyBy operator to pass messages of the same key to the host that will perform the actual aggregation? Or would all the data for a particular key need to be in the same shard so that it can be read and processed by a single host in the cluster? – ChrisATX Feb 17 '20 at 16:55
  • Thank you for updating the original answer. That is exactly what I was looking for. One last follow up question? Do you know anything about the expected differences in performance? I imagine there would be a lot fewer messages being passed over the network between hosts if I shard on userId, but I have read that Flink has a pretty sophisticated credit-based data transfer algorithm. Any idea on the actual impact on performance? – ChrisATX Feb 17 '20 at 21:48
  • The network isn't much of a factor. What matters more is the ser/de overhead, and that's going to happen either way, because of the keyBy. – David Anderson Feb 19 '20 at 14:20
  • BTW, it would be nice if you could accept the answer. – David Anderson Feb 19 '20 at 14:20