1

I don't expect trio to run in any particular order. It is async, after all. But I noticed something strange and wanted to ask if anyone else could explain what might have happened:

  1. I wanted to test the rate of data ingestion from Google's Pub Sub if I send a small message one at a time. In order to focus on the I/O of pushing to Pub Sub, I sent messages async, and I use trio because, well, I want to keep my head from exploding.
  2. I specifically wanted to look at how fast Pub Sub would be if I turned on it's ordering capability. I really just wanted to test throughput, and since I was using an async process, I didn't expect any ordering of messages, but I tagged the messages just out of curiosity.
  3. I noticed that the messages were processed in pub sub (and therefore sent to pub sub) at exactly the opposite order that is written in the imperative code.

Here is the important snippet (I can provide more if it is helpful):

async with open_nursery() as nursery:
    for num in range(num_messages):
        logger.info(f"===Creating data entry # {num}===")
        raw_data = gen_sample(DATASET, fake_generators=GENERATOR) # you can ignore this, it is just a toy data generator. It is synchronous code, but _very_ fast.
        raw_data["message number"] = num  # <== This is the CRITICAL LINE, adding the message number so that I can observe the ordering.
        data = dumps(raw_data).encode("utf-8")
        nursery.start_soon(publish, publisher, topic_path, data, key)

and here is the publish function:

async def publish(
    publisher: PublisherClient, topic: str, data: bytes, ordering_key: str
):
    future = publisher.publish(topic, data=data, ordering_key=ordering_key)
    result = future.result()
    logger.info(
        f"Published {loads(data)} on {topic} with ordering key {ordering_key}   "
        f"Result: {result}"
    )

And when I look at the logs in Pub/Sub, they are 100% consistently in reverse order, such that I see "message number" 50_000 first, then 49_999, 49_998, ..., 3, 2, 1. Pub Sub is maintaining ordering. This means somehow, the async code above is "first" starting the very last task to reach nursery.start_soon.

I'm not sure why that is. I don't understand exactly how Pub Sub's Future works, because the documentation is sparse (at least what I found), so it is possible that the "problem" lies with Google's PublisherClient.publish() method, or Google's result() method that the returned future uses.

But it seems to me that it is actually due to the nursery.start_soon. Any ideas why it would be exactly in the opposite order of how things are written imperatively?

Mike Williamson
  • 4,915
  • 14
  • 67
  • 104

1 Answers1

2

Heh, nicely spotted.

So what's going on here is that Trio intentionally randomizes the order that tasks get scheduled, to help you catch places where you're accidentally relying on scheduler details to coordinate between tasks (since this is fragile, hard to reason about, and constrains trio's ability to improve the scheduler in the future).

Originally, the way this worked is that when tasks became runnable, they'd get appended to a list, and then each scheduler tick would use random.shuffle to randomize the list, then loop over it and run each task.

However! It turns out that randomly shuffling a list is actually kind of expensive, esp for something you're doing on every scheduler tick. So as an optimization, we switched to a much simpler randomization scheme: flip a coin, and if it's heads do nothing, if it's tails then reverse the list. The intuition here is that this is way cheaper, but it still guarantees that any two tasks will be scheduled both as A->B and B->A on some fraction of runs, and this should be enough to catch most cases where you're accidentally relying on scheduling order.

So if you do enough runs, you should find that your log lines are actually 50% in ascending order, and 50% in descending order. (But you shouldn't rely on this :-))

Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49
  • Cool, thanks! That's clever! I just wanted to make sure that in fact this weird occurrence was due to `trio`, and not due to something in the rest of my interactions with GCP Pub/Sub. In fact, knowing this, I'm more confident with everything else in the pipeline! (Oh, and it's a daily pipeline for now, so I'll keep my eye out for the perfectly in order possibility.) – Mike Williamson Jul 26 '23 at 07:29
  • Can I also assume that this means all 50,000 of my nursery tasks are being added to the queue before anything becomes runnable (before trio's next check for anything waiting in the queue)? Otherwise, I'd think that at least some of the tasks are running out of order. Do I understand this correctly? – Mike Williamson Jul 26 '23 at 07:33
  • Yeah, `nursery.start_soon` creates a new task object and immediately marks that task as runnable. But the scheduler doesn't run until you `await` at a "checkpoint". So if you have a loop calling `nursery.start_soon` and there's no `await` in the loop, then all your new tasks will be queued up to start running in the next scheduler tick. – Nathaniel J. Smith Jul 27 '23 at 08:09
  • Interesting design! Out of curiosity, is guaranteeing random execution considered a must for concurrency? – jupiterbjy Jul 28 '23 at 06:49
  • I am not speaking for Nathaniel, but no. The point of concurrency is that you _must_ design _your_ code to have no presumption of ordering. This is just a cheap way to emulate that. _Most_ async I/O-related libraries, including trio, have a "queue" and a "checker". Tasks get loaded to the queue, then every so often, the checker sees what it can take from the queue. Depending upon what is put on the queue in what order and when the check is made, tasks _can_ be grabbed in a different order. But if, like in my case, they stack super-quickly, then they are grabbed in a deterministic order. – Mike Williamson Jul 28 '23 at 08:26