0

I am trying to build an API that uses a Pytorch model. However, as soon as I increase WEB_CONCURRENCY to something above 1, it creates substantially more threads than expected and slows down by a lot, even when sending a single request.

Example code:

api.sh

export WEB_CONCURRENCY=2

python api.py

api.py

from starlette.applications import Starlette
from starlette.responses import UJSONResponse
from starlette.middleware.gzip import GZipMiddleware
from mymodel import Model


model = Model()
app = Starlette(debug=False)
app.add_middleware(GZipMiddleware, minimum_size=1000)    


@app.route('/process', methods=['GET', 'POST', 'HEAD'])
async def add_styles(request):
    if request.method == 'GET':
        params = request.query_params
    elif request.method == 'POST':
        params = await request.json()
    elif request.method == 'HEAD':
        return UJSONResponse([], headers=response_header)

    print('===Request body===')
    print(params)

    model_output = model(params.get('data', [])) # It is very simplified. Inside there are 
                                                 # many things that are happening, which 
                                                 # involve file reading/writing 
                                                 # and spawning processes with `popen` that 
                                                 # do even more processing. But I don't 
                                                 # think that should be an issue here.

    return model_output


if __name__ == '__main__':
    uvicorn.run('api:app', host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))

When WEB_CONCURRENCY=1 in api.sh, there is only 1 python process seen when nvidia-smi is ran and model uses 1.2GB or VRAM. Request takes ~0.7s

When WEB_CONCURRENCY=2 in api.sh, there can be upwards of 8 python processes seen in nvidia-smi and they will use upwards of ~8GB of VRAM. Then one single request can take up to 3s, if you're lucky and don't get an out of memory error.

I am using Python3.8

Why isn't Pytorch using the expected VRAM of 2.4GB when WEB_CONCURRENCY=2? And why is it slowing down so much?

1 Answers1

0

If anyone else stumbles upon this issue, just use gunicorn. It uses separate threads/processes, so there's no internal conflict going on.

So instead of running it with: python api.py, just run with: gunicorn -w 2 api:app -k uvicorn.workers.UvicornWorker