0

In my Scala (2.11) stream application I am consuming data from one queue in IBM MQ and writing it to a Kafka topic that has one partition. After consuming the data from the MQ the message payload gets splitted into 3000 smaller messages that are stored in a Sequence of Strings. Then each of these 3000 messages are send to Kafka (version 2.x) using KafkaProducer.

How would you send those 3000 messages?

I can't increase the number of queues in IBM MQ (not under my control) nor the number of partitions in the topic (ordering of messages is required, and writing a custom partitioner will impact too many consumers of the topic).

The Producer settings are currently:

  • acks=1
  • linger.ms=0
  • batch.size=65536

But optimizing them is probably a question of its own and not part of my current problem.

Currently, I am doing

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}

private lazy val kafkaProducer: KafkaProducer[String, String] = new KafkaProducer[String, String](someProperties)
val messages: Seq[String] = Seq(String1, …, String3000)
for (msg <- messages) {
    val future = kafkaProducer.send(new ProducerRecord[String, String](someTopic, someKey, msg))
    val recordMetadata = future.get()
}

To me it looks like not the most elegant and most efficient way. Is there a programmatic way to increase throughput?


edit after answer from @radai

Thanks to the answer pointing me to the right direction I had a closer look into the different Producer methods. The book Kafka - The Definitive Guide list these methods:

Fire-and-forget We send a message to the server and don’t really care if it arrives succesfully or not. Most of the time, it will arrive successfully, since Kafka is highly available and the producer will retry sending messages automatically. However, some messages will get lost using this method.

Synchronous send We send a message, the send() method returns a Future object, and we use get() to wait on the future and see if the send() was successful or not.

Asynchronous send We call the send() method with a callback function, which gets triggered when it receives a response from the Kafka broker

And now my code looks like this (leaving out error handling and the definition of Callback class):

  val asyncProducer = new KafkaProducer[String, String](someProperties)

  for (msg <- messages) {
    val record = new ProducerRecord[String, String](someTopic, someKey, msg)
    asyncProducer.send(record, new compareProducerCallback)
  }
  asyncProducer.flush()

I have compared all the methods for 10000 very small messages. Here is my measure result:

  1. Fire-and-forget: 173683464ns

  2. Synchronous send: 29195039875ns

  3. Asynchronous send: 44153826ns

To be honest, there is probably more potential to optimize all of them by choosing the right properties (batch.size, linger.ms, ...).

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • 1
    what API are you using where you can construct a KafkaProducer with a string?! – radai Oct 16 '19 at 20:08
  • Isn't there an IBM MQ Kafka Connect source? Then you don't need to write Scala https://github.com/ibm-messaging/kafka-connect-mq-source – OneCricketeer Oct 16 '19 at 23:50
  • Alright, well your current code is perfectly fine, but maybe not if you want to block on sending each and every message. There's also higher level libraries like Akka, Fs2, or ZIO which could help with more functional Scala patterns – OneCricketeer Oct 17 '19 at 04:34
  • @mike This connector is written by people with deep MQ understanding and I would expect it to handle most scenarios. I suggest you open an issue on the MQ connector describing your issues. – Mickael Maison Oct 17 '19 at 09:49

1 Answers1

1

the biggest reason i can see for your code to be slow is that youre waiting on every single send future.

kafka was designed to send batches. by sending one record at a time youre waiting round-trip time for every single record and youre not getting any benefit from compression.

the "idiomatic" thing to do would be send everything, and then block on all the resulting futures in a 2nd loop.

also, if you intend to do this i'd bump linger back up (otherwise your 1st record would result in a batch of size one, slowing you down overall. see https://en.wikipedia.org/wiki/Nagle%27s_algorithm) and call flush() on the producer once your send loop is done.

radai
  • 23,949
  • 10
  • 71
  • 115