0

My producer generates n tasks from a single input message and publishes these on topic.

The requirement is that out of all the individual consumers in the consumer group of topic, no one of them should process more than 3 of any of these n tasks within 1 hour.

This means that if I want to process all these messages immediately, I need at least ceil(n/3) consumers. If there are fewer than ceil(n/3) consumers then I need some way of deferring a message until there have been num_processed < 3 in the last hour.

In terms of practicalities to implement this solution, I am hoping to use Kafka with Faust [1] but I also have access to Redis if necessary.

My idea so far has been to ensure that there are at least ceil(n/3) consumers when producing and then just use round-robin assignment of tasks to topic by the producer. This is the optimal solution anyway because it prevents needing to ever wait up to 1 hour to process messages. However this would only work until enough consumers die whereupon more than 3 could be processed by the same consumer most likely within 1 hour. This is unacceptable.

Another idea might be to have the consumers check each time they take a message whether they have executed 3 of the n tasks already and if so, somehow request that another consumer works on it - but I could not find any suitable mechanism in Kafka to enable this.

[1] https://faust.readthedocs.io/

deed02392
  • 4,799
  • 2
  • 31
  • 48
  • Not sure about faust, but you can set `max.poll.records=3`, then poll indefinitely until you actually get 3 records, then processes them, and commit those offsets and repeat. – OneCricketeer Jan 23 '20 at 18:14
  • Thank you, but I'm not sure that would prevent the processing of more than 3 of the n related messages by the same consumer? – deed02392 Jan 23 '20 at 18:26
  • If you limit the consumer to only poll three messages, and break the loop if not, there is no way it can accept more than that – OneCricketeer Jan 23 '20 at 18:36
  • But after processing those 3 messages, that same consumer may fetch another 3 of the same set of n messages, within 1 hour. – deed02392 Jan 23 '20 at 18:39
  • Only if you continue the poll loop – OneCricketeer Jan 23 '20 at 18:50
  • Can you elaborate in an answer please? – deed02392 Jan 23 '20 at 18:57
  • Again, I don't know faust. In almost all Kafka consumer code, you have `while (true) records = consumer.poll(time)`. Remove/edit the loop condition. Check if the `len(records) >= 3` – OneCricketeer Jan 23 '20 at 19:04
  • I don't *have* to use Faust. I'm just struggling to understand your suggestion in terms of when not polling the loop, you're not sitting idle. – deed02392 Jan 23 '20 at 20:13
  • You don't have to use a loop, basically. Hourly cron works as well – OneCricketeer Jan 23 '20 at 22:25
  • Maybe I did not describe the problem well enough. But these n tasks come in regularly. I don't want a consumer to sit idle when it could start working on 3 more of the tasks from the same topic but which are related to a different set of n tasks. – deed02392 Jan 23 '20 at 22:27
  • Alright, then you need some infinite loop to keep polling. And you can check the amount of records that have been returned since the last time around. If you don't get enough records, then discard that batch and let Kafka internals tell the applicatoin to re-fetch data – OneCricketeer Jan 23 '20 at 23:24

1 Answers1

0

Something that would require a bit more effort, but would make the actual processing a lot easier, is to have a pre-consumer that would simply wait until there's 3 messages to consume, package them up in a "meta-message", and send that to a "ready-for-processing" topic. It would have to be like @cricket_007 mentioned, it shouldn't commit until it actually has consumed 3 messages and produced them into the outbound topic.

This way, the final consumer is an extremely simple one. It would simply consume from the "ready-for-processing" topic, and anytime it gets a message, you know it will have the 3 events you need. You simply process them, and wait another hour until you can poll again. There would be no need to coordinate with any other consumers.

mjuarez
  • 16,372
  • 11
  • 56
  • 73
  • Many thanks for your answer. Unfortunately the issue is not that one consumer needs to process 3 events together, rather it needs to only process a maximum of 3 events per hour out of a larger set of n events. If there are fewer than 3 events for example, they still need to be (and can be) processed immediately – deed02392 Jan 23 '20 at 18:37
  • @deed02392 Sounds like you first want to window events into hourly segments – OneCricketeer Jan 23 '20 at 18:51
  • I think that would force a delay of 1 hour though, even if there were actually enough consumers to distribute all the tasks to be processed immediately? – deed02392 Jan 23 '20 at 18:59