0

We are looking for a message queuing system that ensures the sequential processing of messages based on their row numbers.

This seems like a foundational part of computing - but we seem unable to find it.

Here's what we mean.

PUSH/POP: Adding and removing items from a list is common in code (push/pop). These methods typically ensure every item is processed once-and-only-once ("exactly once processing") and in-order (ordinality).

Finding this as a cloud service has seemed elusive.

GOOGLE: We were told by Google reps that Pub/Sub can guarantee items from a list are emitted in order and once-and-only-once but cannot guarantee that the items you added to the list were added in the order you expected. (The internet can be slow - or drop an item - and the order can be lost.)

APACHE: Kafka, as we were also told, cannot guarantee exactly once processing with ordinality.

The result is that both seemed to be partially implemented push/pop capability.

Does any such Cloud Message Queuing service exist? Or is this simply something we will have to write ourselves?

This seems so basic and fundamental to queue processing that we are surprised if it doesn't exist.


Example Background

Let’s say I have five messages labeled with row numbers one through five, but they may arrive at the message queue out of order. For example, a message with row number 3 might arrive before the one with row number 2. The message queue should forward the messages to the service in ascending order based on their row numbers. Additionally, if there is a delay or error causing a row number to be missing, we want the queue to pause processing until the missing message arrives before sending the subsequent messages to the service. Is there really no suitable message queuing system for this requirement?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Praxiteles
  • 5,802
  • 9
  • 47
  • 78

3 Answers3

1

APACHE: Kafka, as we were also told, cannot guarantee exactly once processing with ordinality

Told by who?

Within a single partition, there is order. But data is persisted, so therefore it could be read more than once, if needed, or if you don't handle offsets carefully. The offset is what determines ordering, not "data within a record (i.e. row number)"

If you don't enable retries on the producer, data within a record batch should also not be re-ordered in faulty network situations.


But Pulsar, RabbitMQ, JMS, ActiveMQ, NATS, Mosquito all exist. Surely one fits your requirement.

But ultimately, you need a distributed lock / global counter that tracks what incremental value you need processed, then you can backfill out of order data in some regular RDBMS - SELECT data_to_process FROM queue WHERE status='WAITING' ORDER BY row DESC LIMIT 1;, then do a simple lookup for after N-1 has completed processing (you could use Kafka or another queue to trigger that event completion/lookup) ;

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
0

This seems so basic and fundamental to queue processing that we are surprised if it doesn't exist.

Normally the queue itself determines order and not some external system.


The first idea would be to use an SQL table as a queue and always request the next item which is not processed (we can also search for exactly the right item via index, because we always know what comes next)


Another idea: have 2 queues:

  • queue 1 has the items unordered
  • queue 2 has the items ordered
  • in between is a message sorter which keeps the items which are not up for grab in memory/persistence store and sorts them when the earlier messages appear.
Loading
  • 1,098
  • 1
  • 12
  • 25
0

I'm not sure there is a messaging system that is going to understand your notion of order inherently and endeavor to deliver message in exactly that order, even if they arrive to the system out of order. To place messages in an order like that would require it to understand the meaning of your messages: that they represent rows that need to be processed in a sequential order. Assuming the system could understand that, it would drastically reduce any parallelism that can be achieved. In Cloud Pub/Sub, you'd have to use a single ordering key for the entire table, which would be an anti-pattern. For Kafka, you'd have to use a single partition or you'd have to carefully manage your consumption from different partitions in a single consumer in order to maintain the desired order.

If throughput is not a concern, you could probably ensure that your rows are received by the messaging system in order: you would only publish one row at a time and you would not start the send of row n + 1 until you got a successful response on the send of row n. You'd probably need to do this from a single publisher as well or you'd have to deal with coordination across them. There also still may be the possibility of duplicate messages in some systems.

If you have a single subscriber/consumer, you could also build this logic into that layer. On the publish side, you'd attach a strictly monotonically increasing sequence number (row number) with no gaps to each message. The subscriber could then buffer messages it receives out of order. It could also throw away duplicates as it knows that once its moved beyond a certain sequence number, it must have processed that message. Even messaging systems that don't guarantee order will likely be relatively well ordered in terms of delivery, so you probably won't have to buffer that often.

Some messaging systems do offer deduping on the publish side, e.g., Pulsar and NATS. These guarantees typically apply if the duplicates are published within a defined window of time.

Alternatively, you could Google Pub/Sub with Dataflow, which could dedupe messages and only emit messages for processing once they are guaranteed to be in order for your definition of order, assuming you can attach a timestamp to each row as an attribute in the message you publish. See "Stream messages from Pub/Sub by using Dataflow."

Kamal Aboul-Hosn
  • 15,111
  • 1
  • 34
  • 46