0

I'm currently using a Kinesis data stream as a command queue for a high number of instances. The original stream is using the instance-id as the partition key.

The goal is to get the ordered events to a worker pool. Each worker pool is capable of handling a specific "instance origin", e.g. Azure, GCP, DataCenter001 - DataCenter100. Since each worker pool can only process a subset, I cannot directly read from this single stream since one parition read could return multiple instance commands from different origins (I could of course simply skip these events, but this would result in each pool ignoring 90% of the commands). To keep this whole process near real-time, my idea was to split the original "uber stream" based on the "instance origin" to one kinesis stream per origin. Then each worker pool can consume it's own kinesis stream.

{
  "instanceID": "9f858f6c-f924-490a-93c7-0b9c037cccc1",
  "origin": "DataCenter007"
  "command": "cd /etc"
}

Original stream partitions on instanceID. Kineses Analytics filters on origin and forwards to a new stream with partitions on instanceID.

However I couldn't find if the KinesisStreamsSink retains the parition key order.

DataStream<InstanceEvent> streamA = originalStream.filter(value -> value.getOrigin().equalsIgnoreCase("AZURE"));
DataStream<InstanceEvent> streamB = originalStream.filter(value -> value.getOrigin().equalsIgnoreCase("GCP"));

streamA.sinkTo(KinesisStreamsSink.<InstanceEvent>builder()
        .setStreamName("DataCenter001")
        .setSerializationSchema(new InstanceEventSerializationSchema())
        .setKinesisClientProperties(producerProperties)
        .setPartitionKeyGenerator(InstanceEvent::getInstanceId)
        .build());
streamB.sinkTo(KinesisStreamsSink.<InstanceEvent>builder()
        .setStreamName("DataCenter002")
        .setSerializationSchema(new InstanceEventSerializationSchema())
        .setKinesisClientProperties(producerProperties)
        .setPartitionKeyGenerator(InstanceEvent::getInstanceId)
        .build());
...

Will stream DataCenter002 and DataCenter002 have the order guarantee from originalStream if they use the same parition key? Or will individual batch fails lead to a wrong order?

peterulb
  • 2,869
  • 13
  • 20

0 Answers0