6

The usual way of implementing the outbox pattern is to store the message payload in an outbox table and have a separate process (the Message Relay) query for pending messages and publish them into a message broker, Kafka in my case.

The state of the outbox table could be as shown below.

 OUTBOX TABLE
 ---------------------------------
|ID | STATE     | TOPIC | PAYLOAD |
 ---------------------------------
| 1 | PROCESSED | user            |
| 2 | PENDING   | user            |
| 3 | PENDING   | billing         |
----------------------------------

My Message Relay is a Spring Boot/Cloud Stream application that periodically (@Scheduled) looks for PENDING records, publishes them into Kafka and updates the record to a PROCESSED state.

The first problem is: if I start multiple instances of the Message Relay all of them would query the Outbox table, and possibly at some point different instances would get the same PENDING registries to publish into Kafka, generating duplicated messages. How can I prevent this?

Another situation: supposing only one Message Relay. It gets one PENDING record, publishes it to the topic but crashes before updating the record to PROCESSED. When it starts up again it would find the same PENDING record and publish it again. Is there a way to avoid this duplication or the only way is to design an idempotent system.

codependent
  • 23,193
  • 31
  • 166
  • 308

4 Answers4

4

To prevent the first problem you have to use database locking.

SELECT * FROM outbox WHERE id = 1 FOR UPDATE

This will prevent other processes from accessing the same row.

The second problem you cannot solve because you don't have distributed transaction with Kafka.

So one way could be to set the record to a state like PROCESSING before sending it to Kafka and if the application crashes you should check if there are records in state PROCESSING and doing some clean up task to find out if they were already sent to Kafka.

But the best solution would be to have an idempotent system that can handle duplicates.

Simon Martinelli
  • 34,053
  • 5
  • 48
  • 82
  • 1
    The consumer can keep a log of messages and check if the same message has arrived before by messageId (in cases it can't be idempotent ) – Vikash Sep 21 '20 at 05:57
0

You can use debezium (https://debezium.io/) in order to read the bin-log of the SQL server and write the events to Kafka. It will solve both your questions.

Roee Gavirel
  • 18,955
  • 12
  • 67
  • 94
0

For 1st problem, you can use ShedLock library. It ensures that at any time, only one instance of your service is picking up the Scheduled task.

For 2nd problem, yes you will have to develop the idempotent consumer. You can do that by passing message id to consumer, and maintain a table to check if that message with message id is already processed, simply ignore it.

AmitB10
  • 425
  • 6
  • 16
0

You can also split your table into partitions using a partition key. Then assign each Message Relay a partition key, so they can filter the records by it. It's a similar approach to sharding, but simple.

  • This avoids waiting for a lock which happens when FOR UPDATE is used.
  • It also avoids querying the same records which cuts down the processing time.