I have a Nvidia A30 linux instance having a single 24 GB GPU, and I am planning to host 3 similar APIs on the same instance. These APIs are containers from the same docker image. I have exposed these 3 containers to access the GPU using Nvidia Container Toolkit, and as expected I am able to get the desired outputs from these container APIs.
The problem lies here: When only one container is up, and it receives a request, the GPU performs at its maximum capacity. But if I up the remaining 2 containers and they too start receiving simultaneous requests, the GPU performance literally gets divided by 3.
I have tried to set --shm-size and --memory in docker run command to different settings but to no avail.
Can someone help out with this?