0

Update:

To give more detail on the problem, put_records are charged based on the number of records (partition keys) submitted and the size of the records. Any record that is smaller than 25KB is charged as one PU (Payload Unit). Our individual records average about 100 Bytes per second. If we put them individually we will spend a couple orders of magnitude more money on PUs than we need to.

Regardless of the solution we want a given UID to always end up in the same shard to simplify the work on the other end of Kinesis. This happens naturally if the UID is used as the partition key.

One way to deal with this would be to continue to do puts for each UID, but buffer them in time. But to efficiently use PUs we'd end up with a delay of 250 seconds introduced in the stream.

The combination of the answer given here and this question gives me a strategy for mapping multiple user IDs to static (predetermined) partition keys for each shard.

This would allow multiple UIDs to be batched into one Payload Unit (using the shared partition key for the target shard) so they can be written out as they come in each second while ensuring a given UID ends up in the correct shard.

Then I just need a buffer for each shard and as soon as enough records are present totaling just under 25KB or 500 records are reached (max per put_records call) the data can be pushed.

That just leaves figuring out ahead of time which shard a given UID would naturally map to if it was used as a partition key.

The AWS Kinesis documentation says this is the method:

Partition keys are Unicode strings with a maximum length limit of 256 bytes. An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards.

Unless someone has done this before I'll try and see if the method in this question generates valid mappings. I'm wondering if I need to convert a regular Python string into a unicode string before doing the MD5.

There are probably other solutions, but this should work and I'll accept the existing answer here if no challenger appears.

Community
  • 1
  • 1
systemjack
  • 2,815
  • 17
  • 26

1 Answers1

0

Excerpt from a previous answer:

  1. Try generating a few random partition_keys, and send distinct value with them to the stream.
  2. Run a consumer application and see which shard delivered which value.
  3. Then map the partition keys which you used to send each record with the corresponding shard.

So, now that you know which partition key to use while sending data to a specific shard, you can use this map while sending those special "to be multiplexed" records...

It's hacky and brute force, but it will work.

Also see previous answer regarding partition keys and shards: https://stackoverflow.com/a/31377477/1622134

Hope this helps.

PS: If you use low level Kinesis APIs and create a custom PutRecord request, in the response you can find which shard the data is placed upon. PutRecordResponse contains shardId information;

http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html

Source: https://stackoverflow.com/a/34901425/1622134

Community
  • 1
  • 1
az3
  • 3,571
  • 31
  • 31
  • If the question is a duplicate, please mark it as such. – Cubic Jan 26 '17 at 10:27
  • I'm not sure, the python tag is a little confusing. – az3 Jan 26 '17 at 11:53
  • That other question isn't quite the same. Interesting solution but not desirable in this case as it would fall to the client to balance the actual uids across the random partition keys and we have 13M of them. We can't randomly generate a partition key for each put like that other answer suggests because we'll have multiple consumers and we want a given uid to always go to the same consumer. – systemjack Jan 26 '17 at 18:17
  • This question and answer is where I'm looking currently: http://stackoverflow.com/questions/33633464/kinesis-partition-key-falls-always-in-the-same-shard?rq=1. The idea would be to load the stream description with its hash key ranges, md5 the uid and compare it to the ranges to find out which shard it will end up in. – systemjack Jan 26 '17 at 18:20