How do Uvicorn workers work, and how many do I need for a slim machine?

Question

The application I deploy is FastAPI with Uvicorn under K8s. While trying to understand how I want to Dockerize the application I understood I want to implement Uvicorn without Gunicorn and to add a system of scale up/down by the load of the requests the application is getting. I did a lot of load testing and discovered that with the default of 1 Uvicorn worker I'm getting 3.5 RPS, while changing the workers to 8 I can get easly 22 RPS (didn't check for more since its great results for me).

Now what I was expecting regarding the resources is that the CPU that I will have to provide will be with a limit of 8 (I assume every worker works on one process and thread), but I saw only increase in the memory usage, but barley in the CPU. maybe its because the app don't use that much CPU but indeed its possible for it to use more than 1 CPU? so far it didn't used more than one CPU.

How do Uvicorn workers work? How should I calculate how many workers I need for the app? I didn't find any useful information.

Again, my goal is to keep a slim machine of 1 cpu, with Autoscaling system.

score 3 · Accepted Answer · answered May 13 '23 at 03:34

When using uvicorn and applying the --workers argument greater than 1, then uvicorn will spawn subprocesses internally using multiprocessing.

You have to remember that uvicorn is asynchronous and that HTTP servers generally are bottle necked by network latency instead of computation. So, it could be that your workloads aren't particularly CPU bound and are IO bound.

Without knowing more about the type of work being done by the server on each request, the best way to determine how many workers you will need will be through empirical experimentation. In other words, just test it until you hit a limit.

Though the FastAPI documentation does include some guidance for your use case:

If you have a cluster of machines with Kubernetes, Docker Swarm Mode, Nomad, or another similar complex system to manage distributed containers on multiple machines, then you will probably want to handle replication at the cluster level instead of using a process manager (like Gunicorn with workers) in each container.

One of those distributed container management systems like Kubernetes normally has some integrated way of handling replication of containers while still supporting load balancing for the incoming requests. All at the cluster level.

In those cases, you would probably want to build a Docker image from scratch as explained above, installing your dependencies, and running a single Uvicorn process instead of running something like Gunicorn with Uvicorn workers. - FastAPI Docs

Emphasis mine.

score 0 · Answer 2 · answered May 17 '23 at 16:56

In concert with @plunker's answer, if we were instead using synchronous workers with gunicorn (or indeed Apache with modperl or myriad others) the processes timeshare the CPU(s) between them, and each request would be handled one after another as the OS is able to schedule them. The individual process handling a single request blocks the CPU until it has finished and all pending I/O has finished. In this scenario you need precisely as many CPUs as you desire your workers to handle simultaneous requests. With one CPU and any number of workers your case is limited to 3.5 requests per second. Any excess requests are buffered by the control thread up to some limit (e.g. 1000 pending requests).

If we have asynchronous workers, as soon as an await call is made the worker can put the request to sleep and allow the CPU to take up another thread. When the awaited event occurs (e.g. DB responds with data), the thread is requeued. As such an async worker and CPU are unblocked whenever await is executed, rather than when the worker completes the request handling.

Network requests occur in the domain of milliseconds, whereas the CPU is operating in the domain of nanoseconds, so a single request to a DB or disk can block a CPU for potentially millions of operations.

Outside of substantial processing happening in your worker (generally a bad idea for availability), a single CPU might address all workers' processing demands before the first DB request is answered. That may explain your 8x performance increase over a single worker.

How many workers can you run on one CPU?

A contemporary virtualised CPU may have 4-8GB available to it, and memory usage scales linearly with the number of workers after the first. Allowing for growth of a worker over its lifespan as well as leaving some memory for disk caching leads me to recommend not allocating more than 50% of the available memory. This is application specific.

There are overheads associated with the control thread dispatching traffic, expiring and respawning workers. You might weigh it like another worker in the worst case.

Finally we must consider the weakest part of the system. It might be a database shared with other apps, it might be network bandwidth. Overloading a database can be much more harmful to service quality than limiting throughput via a suboptimal number of workers.

These combined unknowns make it hard to name a number, as it varies so widely by application and environment. Tools like Apache Benchmark (ab) can be useful for smoking out performance limitations in parallel requests.

You may wish to have a fixed number of async workers per container in order to squeeze bang-for-buck out of one CPU, but I cannot comment on the relative efficiencies of context switching between containers versus between async worker threads.

How do Uvicorn workers work, and how many do I need for a slim machine?

2 Answers2