3

I have a kafka topic with millions of sale events. I have a consumer which on every message will insert the data into 4 table: 1 for the raw sales, 1 for the sales sum by date by product category (date, product_category, sale_sum) 1 for the sales sum by date for customer (date, customer_id, sale_sum) 1 for the sales by date for location (date, location_id, sale_sum)

I use a SQL database for storing my data, so the operations above are insert or update operations.

I am wondering, would it be better to have (i) 1 consumer insert into these 4 tables or (ii) 4 consumers, each responsible for inserting into each table?

What is best practice here?

Thanks

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
friartuck
  • 2,954
  • 4
  • 33
  • 67

1 Answers1

1

From my point of view, you have three different alternatives. Anyway, to be honest, I'd personally choose the third one.



1 - One [consumer-producer] thread

In this scenario, you just have one thread that is responsible of:

1-Reading from Kafka
2-Process/Store in I
3-Process/Store in II
4-Process/Store in III
5-Process/Store in IV

All that, in sequential order, as you just have one thread that both consumes and process the messages.

  kafka-->(read)-->(process 1)-->(process 2)-->(process 3)-->process(4)

In this case, if any of the 2 to 5 steps gets "damaged" and the speed of processing decreases at some point, your entire process will slow down. And with that, the kafka topic's lag, which will increase as far as the thread doesn't finish the 5th step earlier than new message arrives at Kafka.

For me, this is a no-no regarding performance and fault-tolerance



2 - Four [consumer-producer]s

This uses the same paradigm as the first scenario: the thread that reads also is responsible of the processing.

But, thanks to consumer-groups, you can paralellize the whole process. Create 4 different groups and assign each one to a consumer.

For simplicity, let's just create one thread per consuemr group.

In this sceenario, you have something like:

CONSUMER CG1
1-Reading from Kafka
2-Process/Store in I

CONSUMER CG2
1-Reading from Kafka
2-Process/Store in II

CONSUMER CG3
1-Reading from Kafka
2-Process/Store in III

CONSUMER CG4
1-Reading from Kafka
2-Process/Store in IV

       |-->consumer 1-->(process1)-->T1
  kafka|-->consumer 2-->(process2)-->T2
       |-->consumer 3-->(process2)-->T3
       |-->consumer 4-->(process4)-->T4

Advantages: each thread is responsible of a limited number of tasks. This will help with the lag of each consumer group.

Furthermore, if some of the storing tasks fail or decreases its performance, that won't affect the other three threads: They will continue reading and processing from kafka by their own.



3. Decouple consuming and processing

This is by far, in my oppinion, the best possible solution.

You divide the tasks of reading and the tasks of processing. This way, you can for example launch:

  • One consumer thread

    This just reads the messages from kafka and stores it in an in-memory queues, or similiar structures that are accesible from the worker threads, and that's all. Just continue reading and putting the message in queues.

  • X worker threads (in this case, 4)

    This threads are responsible of getting the messages that the consumer put in the queues (or queues, depending on how you want to code it), and processing/storing the messages in each table.

Something like:

                            |--> queue1 -----> worker 1 --> T1
  kafka--->consumer--(msg)--|--> queue2 -----> worker 2 --> T2
                            |--> queue3 -----> worker 3 --> T3
                            |--> queue4 -----> worker 4 --> T4

What you get here is: paralellization, decoupling of processing and consuming. Here kafka's lag will , at 99% of the time, 0.

In this approach, the queues are the ones that act like buffers if some of the workers get stuck. The other whole system (mainly Kafka) will not be affected by the processing logic.

Note that even Kafka won't start lagging and possibly lossing messages due to retention, the internal queues must be monitorized, or configured properly to send the lagged messages inside the queue to a dead-letter queue, in order to avoid the consumer get stuck.




This is from the KafkaConsumer javadoc, which better explains the pros and contras of each paradigm:

enter image description here

enter image description here


A simple diagram showing the advantages of the third scenario:

enter image description here

Consumer thread just consumes. This avoids kafka lagging, delays in the data that must be processed (remember, this should be near real-time) and loss of messages because of retention kicking in.

The other x workers are responsible of the actual processing logic. If something fails in one of them, no other consumer or worker thread gets affected.

aran
  • 10,978
  • 5
  • 39
  • 69
  • i'm not seeing the benefit of approach 3. additional queues for what? additionally, if the in-memory queues go down you loose your messages. i have 16 partitions, and the ordering matters (i need to consumer events in the order they are inserted into kafka) . you talk of worker threads, but if you have less consumers than partitions your consumers will act like worker threads as they service different partitions. also, in approaches 2 and 3, how would you de-duplicate your data. have a fact store for each consumer (likely with some timeout so you don't keep the message ids indefinitely) – friartuck Oct 06 '22 at 06:24
  • @friartuck additional queues for what? Well, how are you supposed to decouple consuming and processing then? Take a simple look at the docs, point 2. Decouple consumption and processing: " one or more consuumer threads that do all the data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads" – aran Oct 06 '22 at 08:04
  • Table I, II, III and IV will all have their messages in order. That doesn't mean that Table I could already hace stored 6 times more registries than Table II. But the order you need to honour is the order in EACH table, and any of this approaches will get the job done, in order. At the end of the day, every single table will have its messages processed in the consuming order. And no, there's no need of state stores. – aran Oct 06 '22 at 08:09
  • Again, yes.. You must monitor the queues, as said in the answer: "Note that even Kafka won't start lagging and possibly lossing messages due to retention, the internal queues must be monitorized, or configured properly to send the lagged messages inside the queue to a dead-letter queue, in order to avoid the consumer get stuck." I'm not reinventing the wheel here, this is pure Kafka architecture and best practices, by Confluent itself. Please do take a look at the javadoc linked in the answert – aran Oct 06 '22 at 08:11
  • i'm not sure i see the 'decoupling'. You take messages from a kafka topic which may or may not have multiple partitions and put them into other new queues. Do you only commit offsets after the worker thread has processed? – friartuck Oct 06 '22 at 15:03
  • Now you have problems like out-of-order messaging, the possibility of these in-memory queues going down and offset committing, like the link says. Right now I use kafkajs library (using express.js) to create 16 consumers. Maybe this library is doing more for me than I realise, but I don't think it's using additonal queues. It will fetch a bunch of messages, and will only commit offset them once all the messages have been processed in that batch. If the batch of message needs to be restarted (server restart), I have to handle de-duplication myself. – friartuck Oct 06 '22 at 15:06
  • If I have 16 partitoins I create 16 consumers objects and the kafka broker handles which consumers will process which partition at which time. I don't think there is any in-memory queues involved, and I don't think it I have my consumers push into new queues I would get any further advantage... or am I missing something obvious. feels like I am. – friartuck Oct 06 '22 at 15:07
  • Trust me, decoupling consuming and processing does make an advantage. Of course those queues dont exist in Kafka: those are your own Deques, Queues, or whatever you want to create to achieve the division of the consumer and the workers. If you create 16 consumers in the same consumer group, one thing is for clear: you won't achieve any ordering. If you create 4 consumers, each with its own consumer group, all 4 consumers will read the whole kafka topic, so your order is guaranteed. So for the 2nd scenario: 4 processes that are parallelizing the job, making it 4x faster. – aran Oct 06 '22 at 18:19
  • If you don't decouple the consumer, you have a thread that sequentially needs to: process I, process II, process III and process IV. And only after all of those, you will read again from kafka. Just imagine if process II starts lagging for an exception: you have a complete shutdown of your system: kafka will start deleting messages because you can't read fast enough from the topic. This is a real time streaming paradigm: what will happen if for just one task, ALL the system collapses? – aran Oct 06 '22 at 18:21
  • If you don't really need that decoupling, I'd recommend going for the 2nd option: 4 consumers, each one with a different consumer group. Each consumer will read ALL the data and process it in order. The difference is that this consumers don't need to fulfill 4 tasks, but just 1 each. – aran Oct 06 '22 at 18:25
  • The easiest example to realize the use cases of decoupling is this: imagine that you have a consumer that reads some data from a topic and must send this message to external endpoints (foreign SFTPs, etc). In this scenario, if you are not able to decouple the consumption from the processing, your whole system is prone to fail for just an external failure, such errors in an sftp server. Just imagine that by contract you must send these messages to 50 different companies. How would you justify not being able to send the data to 49 of them because of a failure from an external endpoint? – aran Oct 06 '22 at 18:29
  • If you decouple it, you can: keep reading from kafka so that the lag doesn't increase and starts deleting old messages. Also, you can identify which endpoint/worker is giving you problems, so you can store the failed messages in another dead-letter queue. – aran Oct 06 '22 at 18:31
  • I believe what you are missing here is the pressure Kafka puts regarding speed of processing in order to be up-to-date every time. If you can't read fast enough, that means: delays in the results, and ultimately losing data because of kafka's retention time. – aran Oct 06 '22 at 18:36