7

I'm running an asyncio application which needs more than one event loop to service a large number of IO operations (upward of a thousand simultaneously). The event loops run in a separate thread and loops forever on coroutines as they are submitted.

I'm looking for a way to identify when an existing event loop is near full capacity so I can fire up a new event loop thread on demand, rather than pre-specifying how many event loops I want to run.

Near capacity would mean the event loop is keeping busy say 80%+ of the time. If an event loop is spending less than 20% of its time in a wait state it's time to add another event loop thread.

It doesn't seem like this is easy per-thread: Profile Python CPU Usage By Thread

David Parks
  • 30,789
  • 47
  • 185
  • 328
  • 1
    In light of the GIL, would multiple event loops in different threads give much of a performance benefit? Performance could even go down. Would running them in different processes be better? – Michal Charemza Sep 20 '19 at 07:15
  • I have already run multiple event loops within a single python process and it scaled well. I've been able to initiate 1000 parallel reads against our local S3 cluster endpoint and sustained 1.5 GB/sec random-access reads of large files with 5MB contiguous chunks. That required 4 event loops which were running in their own threads. – David Parks Sep 20 '19 at 21:35

2 Answers2

2

I'm looking for a way to identify when an existing event loop is near full capacity so I can fire up a new event loop thread

I don't think this approach can work because of the GIL. The use case you seem to be describing is that of event loops stalling due to CPU overload. If that is the case, adding more threads won't help simply because CPU work is, except for rare exceptions, not parallelized in Python.

If your event loops are doing too much CPU-related work (e.g. calculations), you should move those individual units of work to separate threads using run_in_executor. If that is not enough, you can try switching to uvloop, a high-performance asyncio drop-in replacement for CPython. You can also try asyncio with PyPy.

If none of those options work, the next thing to try is some variant of multiprocessing. (Or a more low-level/performance-oriented language.)

user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • I've had many event loops running in separate threads, they don't seem affected by the GIL. I've performed reads from S3 with 1000 parallel S3 read operations across multiple event loops and seen read rates of 1.5 GB/sec against a local S3 endpoint in a large cluster using this technique. – David Parks Sep 20 '19 at 15:34
  • The coroutines aren't CPU intensive, they are purely `aiobotocore/boto3` IO only. But that much IO requires more than 1 CPU core of compute to service. Multiple event loops do allow it to scale up. – David Parks Sep 20 '19 at 15:35
  • @DavidParks That's really interesting. Did you measure the read rates with a single event loop? – user4815162342 Sep 20 '19 at 20:16
  • In my previous experimentation I had everything packaged in an object with a single event loop in its own thread, to achieve higher than a few hundred MB/s I simply created multiple such objects, each explicitly defining their own event loop with `new_event_loop()` and running it in a thread. I distributed requests evenly and it scaled up well. I've since re-written that earlier work and am now encapsulating the multiple event loops under one object. In the original case I simply specified a number of objects (and hence event loops) to spawn, so I'm trying to one-up that in this iteration. – David Parks Sep 20 '19 at 20:48
  • @DavidParks It would be interesting to also try with a single event loop, as asyncio is meant to be used. There is a chance that you will get the same, or possibly event higher rates. – user4815162342 Sep 21 '19 at 06:13
  • With a single event loop you see a single core max out at 100% and no more IO throughput is possible no matter how many coroutines you throw at it. This isn't a limitation on S3 or the network bandwidth. As soon as I add another event loop I can achieve higher IO throughput. There just is a point at which a single core can't service all the IO requests at these scales. `asyncio` is built to allow multiple event loops by design. – David Parks Sep 22 '19 at 16:24
  • 1
    @DavidParks I was referring to the limitation of Python's GIL, not of S3 or the network stack. All Python code is effectively serialized by the GIL, though I suppose in your case it helps that the OS gets to do more polls in parallel, so even the serialized Python responses get done sooner/faster as a result. If the technique of spawning multiple event loops to share the load can be generalized to other domains, you might want to share it in a blog post, as it's not well-known in asyncio. I would encourage you to try uvloop nonetheless, because you might get better performance yet. – user4815162342 Sep 22 '19 at 17:14
  • I'll give `uvloop` a careful eval, thanks for the pointer on that, I wasn't aware of that package before. – David Parks Sep 22 '19 at 17:16
1

If you want to utilize more available machine resources it's easier to achieve delegating this job to an outer supervisor that manages multiple python processes.

And spawning more processes on capacity limit sounds like something load balancer should do.

Delegating this job(s) to time-proved solutions seems to be better choice than to write your own on Python (de-facto). I'm also skeptical towards idea of mixing application business logic with deployment related details that may change depending on concrete server infrastructure.

aiohttp has a nice manual of basic deployment process.

Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159
  • The purpose of this code is to provide a consistent and scalable way to perform random-access reads against S3 for ML workloads. A single GPU is processing the data, and to keep a GPU busy we need a stream of 300-700 MB/s of data off S3 (this is generally raw, uncompressable data). These are all needed within a single python process. Our S3 endpoint provides a nice scalable, sharable platform for us, capable of feeding well above these rates, I'm tasked with simply getting the data out of S3 and onto a GPU given all the memory and CPU I need to do so. – David Parks Sep 20 '19 at 21:26