0

My use case requires to read messages from a Kafka topics and process the messages in the natural order as they were published into the Kafka.

The Kafka producer is responsible to publish each group of messages sorted in a single kafka topic-partition, and I need to process each group of message in the same Vertex-Processor in the same order.

enter image description here

The image above represents the basic idea. There a few KafkaSource-Processors reading from Kafka.

And one edge connected to a vertex to decode the kafka message and so on.

I could use the kafka message key as the partitioning key, but I think that I will end up with unbalanced decode processor.

Given that:

  • How can I create a new Partitioner ? I couldn't find any example to inspire me.
  • On the new Partitioner, how can I identify KS processor that emitted the message ? I would like to have a 1-to-1 relationship between previous vertex process and the next vertex processor, for instance, KS#0 always send the messages to the Decode#0, KS#1 to Decode#1 and so on.
  • Do I need a new partitioner for that or is there some out-of-the-box functionality to achieve that ?
Oliv
  • 10,221
  • 3
  • 55
  • 76
Kleyson Rios
  • 2,597
  • 5
  • 40
  • 65
  • A partitioned is just a https://github.com/hazelcast/hazelcast-jet/blob/master/hazelcast-jet-core/src/main/java/com/hazelcast/jet/function/DistributedFunction.java . For example, `entryKey()` is just https://github.com/hazelcast/hazelcast-jet/blob/master/hazelcast-jet-core/src/main/java/com/hazelcast/jet/function/DistributedFunctions.java#L51 – Neil Stevenson Jan 28 '18 at 14:32
  • Thanks @NeilStevenson, but is it possible in the partitioner get the "id" of the processor the emitted the data and force the data be routed to a specific processor id ? – Kleyson Rios Jan 28 '18 at 17:36
  • I think this could be an option out of the box, right ? https://github.com/hazelcast/hazelcast-jet/blob/4f9cac1ee3424ce9627433c350c7f42265cb57bb/hazelcast-jet-core/src/main/java/com/hazelcast/jet/core/Edge.java#L475 - https://github.com/hazelcast/hazelcast-jet/blob/8a3c946853b632b71c091da05ee11a78cffd2f55/hazelcast-jet-core/src/test/java/com/hazelcast/jet/core/RoutingPolicyTest.java#L130 – Kleyson Rios Jan 28 '18 at 18:13
  • Yip. There will be at least one DAG instance per JVM and multiple JVMs. You control whether a vertex in a DAG sends to other DAGs, and there are various ways this affects performance (eg. filter before distribute is faster than distribute before filter). Pipelines are easier than DAGs if you don't need that low level control, but the multi-plexing on the right of your diagram suggests you'll need the latter. – Neil Stevenson Jan 28 '18 at 22:10
  • 1
    Can you make use of the [`projectionFn`](https://github.com/hazelcast/hazelcast-jet/blob/master/hazelcast-jet-kafka/src/main/java/com/hazelcast/jet/KafkaSources.java#L89) on the Kafka source? Could you put all your decode+meta logic in it? That would directly ensure ordering. – Marko Topolnik Jan 29 '18 at 09:17
  • Interesting @MarkoTopolnik . I think I could have them on the projectionFn. I'm a little afraid that this could become a bottleneck in the system, but I will give a try. – Kleyson Rios Jan 29 '18 at 10:00
  • Yes, it depends on the throughput and the cost of decoding. Typically you have some CPU headroom left even when consuming from Kafka at full speed. – Marko Topolnik Jan 29 '18 at 10:06

1 Answers1

2

You don't need to use partitioner for this. Edge.isolated() together with equal local parallelism is designed for this:

dag.edge(between(kafkaSource, decode).isolated());

In this case, one instance of source processor is bound with exactly one instance of target processor and ordering of items will be preserved. Keep in mind that single Kafka source processor can take items from more than one Kafka partition, so you have to track the Kafka partition id. Even if you make the total number of Jet processors and Kafka partitions equal, you can't rely on it, because if one of the members fails and the job is restarted, the total number Jet processors will decrease but the number of Kafka partitions won't.

Also note that default local parallelism is not equal for sources: For Kafka source it defaults to 2, for others it typically is equal to CPU count. You need to manually specify equal value.

Another limitation is if you use Processors.mapP for your decode vertex, the mapping function must be stateless. Because you need the items to be ordered I assume that you have some state to keep. For it to work correctly, you have to use custom processor:

Vertex decode = dag.newVertex("decode", MyDecodeP::new);

Processor implementation:

private static class MyDecodeP extends AbstractProcessor {
    private Object myStateObject;

    @Override
    protected boolean tryProcess(int ordinal, @Nonnull Object item) {
        Object mappedItem = ...;
        return tryEmit(mappedItem);
    }
}

The answer was written for Jet 0.5.1.

Oliv
  • 10,221
  • 3
  • 55
  • 76