0

One of our system has a micro service architecture using Apache Kafka as a service bus. Low latency is a very important factor but reliability and consistency (exactly once) are even more important.

When we perform some load tests we noticed signifiant performance degradation and all investigations pointed to big increases in Kafka topics producer and consumer latencies. No matter how much configuration we changed or more resources we added we could not get rid of the symptoms.

At the moment our needs are processing 10 transactions per second (TPS) and the load test is exercising 20 TPS but as the system is evolving and adding more functionality we know we'll reach a stage when the need will be 500TPS so we started being worried if we can achieve this with Kafka.

As a proof of concept I tried to switch to one of our micro services to use a chronicle-queue instead of a Kafka topic. It was easy to migrate following the avro example as from Chronicle-Queue-Demo git hub repo

public class MessageAppender {
    private static final String MESSAGES = "/tmp/messages";

    private final AvroHelper avroHelper;
    private final ExcerptAppender messageAppender;

    public MessageAppender() {
        avroHelper = new AvroHelper();
        messageAppender = SingleChronicleQueueBuilder.binary(MESSAGES).build().acquireAppender();
    }

    @SneakyThrows
    public long append(Message message) {
        try (var documentContext = messageAppender.writingDocument()) {
            var paymentRecord = avroHelper.getGenericRecord();
            paymentRecord.put("id", message.getId());
            paymentRecord.put("workflow", message.getWorkflow());
            paymentRecord.put("workflowStep", message.getWorkflowStep());
            paymentRecord.put("securityClaims", message.getSecurityClaims());
            paymentRecord.put("payload", message.getPayload());
            paymentRecord.put("headers", message.getHeaders());
            paymentRecord.put("status", message.getStatus());
            avroHelper.writeToOutputStream(paymentRecord, documentContext.wire().bytes().outputStream());
            return messageAppender.lastIndexAppended();
        }
    }
}

After configuring that appender we ran a loop to produce 100_000 messages to a chronicle queue. Every message has the same size and the final size of the file was 621MB. It took 22 minutes 20 seconds and 613 milliseconds (~1341seconds) to process write all messages so an average of about 75 message/second.

This was definitely not what we hopped for and so far from latencies advertised in the chronicle documentation that made me believe my approach was not the correct one. I admit that our messages are not small at about 6.36KB/message but i have no doubts storing them in a database would be faster so I still think I am not doing it right.

It is important our messages are process one by one.

Thank you in advance for your inputs and or suggestions.

Julian
  • 3,678
  • 7
  • 40
  • 72
  • Can I suggest you challenge if you need a distributed micro-sevice arch. What problem is it meant to be solving for you? Is it actually solving that problem?! Remember a method call takes nano seconds but a network call takes ms (x1000 slower) You can get most of the advantages of micro-sevices by having many dll projects and dependency injection. Each project should have its own individual isolated data store if needed. Maybe add a pub sub backbone e.g., guava or use akka (akka will let you pretty seamlessly scale to a distributed system if you out grow your hardware). My 2cents hope it helps – DarcyThomas Apr 17 '21 at 22:28
  • In my POC I used just my Mac book so it was no networking. And yes our micro service architecture is done in such a way that every mico service instance has it's own data store. There will be networking of course when passing messages between miro services running on different machines but in this POC that was not he case. – Julian Apr 18 '21 at 00:06
  • Maybe attach a profiler and see if you are making an excessive amount of object allocations, particularly bad if they are getting promoted to higher generations. See if they are your objects or Chronicle Queue ones. Are you making out your ram or cpu (or network)? – DarcyThomas Apr 19 '21 at 02:57
  • Hand building the Avro object each time seems a bit of a code smell to me. Can you create a predefined message -> avro serializer and use that to feed the queue? Or just for testing create one avro object outside the loop and feed that one object into the queue many times. That way you can see if it is the building or the queuing which is the bottleneck. – DarcyThomas Apr 19 '21 at 03:21
  • Please post your last suggestion as an answer. I put some log entries and found out that when creating the message to play with it was a digital signature and a marshalling operations that took around 12 mills to complete which of course multiplied by 100,000 was adding up. Reusing the same signature reduced the time to only three seconds which is a totally different result. Very happy with the outcome. Still o see what will happen when having the queue somewhere in AWS on some EBS storage but so far it looks promising. – Julian Apr 19 '21 at 06:59

1 Answers1

0

Hand building the Avro object each time seems a bit of a code smell to me.

Can you create a predefined message -> avro serializer and use that to feed the queue?

Or, just for testing, create one avro object outside the loop and feed that one object into the queue many times. That way you can see if it is the building or the queuing which is the bottleneck.


More general advice:

Maybe attach a profiler and see if you are making an excessive amount of object allocations. Which is particularly bad if they are getting promoted to higher generations.

See if they are your objects or Chronicle Queue ones.

Is your code maxing out your ram or cpu (or network)?

DarcyThomas
  • 1,218
  • 13
  • 30