We are planning to use the JMS Source Connector to stream data into our Kafka cluster. The data from ActiveMQ is in XML format. The JMS Source Connector uses the internal messageID (Message.getJMSMessageID()) as key.
The field that serves as the key - on the Kafka topic that the Connector streams to - needs to be retrieved from within the (XML) payload.
To achieve this a few steps are necessary in the Connector.
- To transform the XML to the internal Kafka Connect Struct we use a custom transformation plugin (https://github.com/jcustenborder/kafka-connect-transform-xml)
- Then the ValueToKey and ExtractField transformers set the key that was part of the payload.
- Now this key-value pair is ready to be sent to our Kafka topic.
We are dealing with financial transactions and need to guarantee the order of the messages. We have high throughput, and from what I understand, configuring tasks.max allows parallelism by distributing tasks among the Kafka Connect Workers.
First question: How does parallelism work in combination with single message transformers? Do the '(Source)Connector - Transformer - Converter' form a pipeline that is together distributed by setting tasks.max, or does the tasks.max setting somehow only apply to the Connector?
The latter seems a bit odd, so assuming the former is correct I have another doubt.
Our Kafka topic key - that will guarantee the order on the Kafka topic - is determined within the Connector's tasks. Choosing tasks.max > 1 the incoming messages are distributed among the running tasks.
Being distributed among multiple tasks, two (or more) messages (containing the same key within the payload) arrive in a certain order from ActiveMQ and can be sent to different Kafka Connect tasks.
Theoretically the order could be reversed when finally streaming into the Kafka topic (on the same partition, as they now have the same key).
Am I right reasoning this way and is there are way to circumvent this? Or is an ordering guarantee only possible in this use-case using one task only.