1

We are planning to use the JMS Source Connector to stream data into our Kafka cluster. The data from ActiveMQ is in XML format. The JMS Source Connector uses the internal messageID (Message.getJMSMessageID()) as key.

The field that serves as the key - on the Kafka topic that the Connector streams to - needs to be retrieved from within the (XML) payload.

To achieve this a few steps are necessary in the Connector.

  • To transform the XML to the internal Kafka Connect Struct we use a custom transformation plugin (https://github.com/jcustenborder/kafka-connect-transform-xml)
  • Then the ValueToKey and ExtractField transformers set the key that was part of the payload.
  • Now this key-value pair is ready to be sent to our Kafka topic.

We are dealing with financial transactions and need to guarantee the order of the messages. We have high throughput, and from what I understand, configuring tasks.max allows parallelism by distributing tasks among the Kafka Connect Workers.

First question: How does parallelism work in combination with single message transformers? Do the '(Source)Connector - Transformer - Converter' form a pipeline that is together distributed by setting tasks.max, or does the tasks.max setting somehow only apply to the Connector?

The latter seems a bit odd, so assuming the former is correct I have another doubt.

Our Kafka topic key - that will guarantee the order on the Kafka topic - is determined within the Connector's tasks. Choosing tasks.max > 1 the incoming messages are distributed among the running tasks.

Being distributed among multiple tasks, two (or more) messages (containing the same key within the payload) arrive in a certain order from ActiveMQ and can be sent to different Kafka Connect tasks.

Theoretically the order could be reversed when finally streaming into the Kafka topic (on the same partition, as they now have the same key).

Am I right reasoning this way and is there are way to circumvent this? Or is an ordering guarantee only possible in this use-case using one task only.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
LeonardoBonacci
  • 103
  • 2
  • 7

2 Answers2

0

does the tasks.max setting somehow only apply to the Connector?

This

Our Kafka topic key - that will guarantee the order on the Kafka topic

No it doesn't. It only guarantees partitioning, and that's it

is an ordering guarantee only possible in this use-case using one task only.

It depends on the source. I don't know AMQ, but if reading a message removes it from the queue, there's no chance that multiple tasks would get the message

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Perhaps I wasn't clear before. The relative ordering of ActiveMQ messages based on a field within the payload (for example some field has value 'A') needs to be maintained on the Kafka topic. Same key means same partition on the topic. The doubt arises when Kafka Connect runs in parallel (tasks.max>1) and ActiveMQ records entering different Kafka Connect Worker threads can overtake each other. The only ordering guarantee I see is having tasks.max=1 – LeonardoBonacci Feb 20 '20 at 20:40
0

How does parallelism work in combination with single message transformers?

Your first answer is correct - the transformers execute at each of the running tasks, in the order defined in your connector configuration. Each SourceRecord generated by a task is processed by all transformers in the same task and then sent to Kafka.

Am I right reasoning this way and is there are way to circumvent this? Or is an ordering guarantee only possible in this use-case using one task only.

The easiest way to guarantee ordering of messages is to have a single task, but this does not scale, obviously. There are several ways to get around this.

  1. Partition the messages that tasks read, such that each task will always read all messages with the same key. Some Message Queue servers have built in support for this. ActiveMQ, for instance, supports Selectors. In this case you might be able to have each task read only messages where:

<message ID> MOD <Number of tasks> == <Task ID>

This is not trivial to implement (you need to deal with changes in the number of tasks at Runtime, for instance), but is doable.

  1. Partition the original messages into different ActiveMQ queues based on the ID. So now you get one Connect task for one ActiveMQ queue.

  2. Sort the messages using Kafka Streams. Basically, you will read the messages from ActiveMQ and write them into the topic 'unsorted-messages' using Kafka Connect. A separate Kafka Streams application will read from the 'unsorted-messages' topic and write the sorted data to the 'sorted-messages' topic. This is discussed here: Apache Kafka order windowed messages based on their value

Barak
  • 3,066
  • 2
  • 20
  • 33