0

Note: this is about adaptive scaling SQS poller/subscriber, but I tried to present it in slightly more abstract terms, just a backpressure managment and a mediator.

In code, I skipped all lock, critical sections, semaphores, and try/finally blocks for clarity, but I'm fully aware of those.

Imagine I have an internal queue and a consumer and a poller, which from perspective of this queue is a producer but, it is really a mediator between consumer and some external source (mentioned SQS).

# smaller internal queue is better, 
# as all those items in queue are in kind of limbo
queue = Queue(32) 

def consume():
    while True:
        # this may block if queue is empty
        item = queue.dequeue()
        consume(item)

def poll():
    running += 1
    while True:
        if running > expected:
            break
        items = poll_batch()
        for item in items:
            # this may block if queue is full
            queue.enqueue(item) 
    running -= 1

Now, poll_batch has natural limitation, it always return up 10 items, and has network related overhead. So if request roundtrip takes 100ms, there is a natural maximum limit of 100 items per second, and there is no way around it. To address that problem, I can create additional poll loops (if it goes too slow) or stop existing ones (when it goes too fast).

For example, I can imagine a supervisor:

def supervisor():
    while True:
        sleep(1)
        decision = do_i_need_more_or_less() # -1, 0, 1
        expected = running + decision
        if expected > running:
            new_thread(poll)

So now, the question is:

how to implement need_more_or_less() and what date it may have and what date it needs?

Theoretically it is simple, we need to balance consumption rate with production rate, but when I get into actual implementation there are few problems:

Actual consumption rate is never higher than production rate as it will not consume more that it produced (ok, there is buffer between them, so it can, but this internal queue is intentionally small, I would even say queue of length 1 would be ideal). So we need "potential consumption rate" (not "actual") which we can measure only if we ramp up production rate above potential limit, but then internal queue will fill up and all pollers will be blocked. But how to detect such blocking? How to balance if 3 pollers where blocker, 2 poller were not? What does it mean? Does it matter if one was blocked with 1 item waiting and the other with 7 items waiting? Probably. What about ther other way around (when one had 1 item headroom and the other has 5 items headroom)?

I can say that with quite large internal queue it might be easy: having internal queue limit set to 100000 items, I can easily spot it is going down (I can increase polling rate) or going up (polling rate is too high), but with a queue of size 32 it actually does not work, it changes too quickly, and jumps up and down.

Milosz Krajewski
  • 1,160
  • 1
  • 12
  • 19
  • Why have an 'internal queue'? Can't they just pull from the SQS queue when they want some input? And what's the use of having an Internal Queue if it can become 'full'? Should a queue have effectively infinite capacity? – John Rotenstein Jun 03 '23 at 20:33
  • @JohnRotenstein it is a valid question, and I actually started with the design without a internal queue and it allow me to spot initial problem. When you need more work and you poll it is already too late - you wait for poll to return, let's say, for 100ms and you effectively do nothing. So you need to poll before. Now, maybe even two pollers cannot deliver items fast enough, maybe three, etc? How to find out how many you need. That's the real question, internal queue is just to make sure next poll can start before consumer runs out of work, but if removed problem remain the same. – Milosz Krajewski Jun 04 '23 at 11:49
  • If your goal is to process messages faster, then simply have more simultaneous workers. Rather than being worried about a worker waiting for a `GetMessage()` response, simply run more workers and take advantage of that 'quiet' time to allow more CPU or I/O to go to other workers. Does your system have a particular 'bottleneck' such as CPU, I/O or throughput? If so, manage the number of **parallel workers** to stay within that limitation. Concentrate on _total throughput_ rather than the throughput of an individual worker. – John Rotenstein Jun 05 '23 at 07:43
  • @JohnRotenstein this is not really an answer to my question. Number of workers is determined by different heuristics and it is not this component to decide (not even this team), but I need to make sure those workers are busy. I need to adjust polling speed to potential consumption rate. – Milosz Krajewski Jun 05 '23 at 09:09

0 Answers0