1

I am running an ML inference for image recognition on the GPU using onnxruntime and I am seeing an upper limit for how much performance improvement batching of images is giving me - there is reduction in inference time upto around batch_size of 8, beyond that the time remains constant. I assume this must be because of some max utilization of the GPU resources, as I dont see any such limitation mentioned in the onnx documentation. I tried using the package pynmvl.smi to get nvidia_smi and printed some utilization factors during inference as such -

utilization_percent = nvidia_smi.getInstance().DeviceQuery()['gpu'][0]['utilization']
gpu_util.append(utilization_percent ['gpu_util'])
mem_util.append(utilization_percent ['memory_util'])

What I do see is that the gpu_util and the memory_util are within 25% for the entire run of my inference, even at batch size like 32 or 64, so these are unlikely to be the cause of the bottleneck. I assume then, that it must be the bus load limitation that might be causing this. I did not find any option within nvidia-smi to print the GPU bus load. How can I find the bus load during the inference?

sn710
  • 581
  • 5
  • 20

0 Answers0