2

I have trained YOLO-v3 tiny on my custom dataset using PyTorch. For comparing the inferencing time, I tried onnxruntime on CPU along with PyTorch GPU and PyTorch CPU. The average running times are around:

onnxruntime cpu: 110 ms - CPU usage: 60%
Pytorch GPU: 50 ms
Pytorch CPU: 165 ms - CPU usage: 40%
and all models are working with batch size 1.

However, I don't understand how onnxruntime is faster compared to PyTorch CPU as I have not used any optimization options of onnxruntime. I just used this:

onnx_model = onnxruntime.InferenceSession('model.onnx')
onnx_model.run(None,{onnx_model.get_inputs()[0].name: input_imgs })

Can someone explain to me why is it faster without any optimization? and also why CPU usage is higher while using onnxruntime. is there any solution to keep it down?

Thanks in advance.

politinsa
  • 3,480
  • 1
  • 11
  • 36

2 Answers2

5

ONNX Runtime uses static ONNX graph, so it has full view of the graph and can do a lot of optimizations that are impossible/harder to do with PyTorch. In a sense, it's similar to compiled vs interpreted programming language implementations.

Sergii Dymchenko
  • 6,890
  • 1
  • 21
  • 46
0

If you observe high CPU usage with onnxruntime-GPU:

The most common reason for this is that there may be some layers which are not supported by CUDA EP (or any other EP that you are trying to run) and for that specific unsupported layer, onnxruntime transfers whole tensor into CPU, executes the operation and transfers the resulting tensor back to GPU. It would increase CPU usage, reduce the speed compared to Torch-GPU execution. You can easily observe which layer is causing the problem by profiling the inference and visualising the logs.

If you observe high CPU usage with onnxruntime-CPU:

The main reason for this part is highly optimised inference engine & graph structure. As a hyperparameter, you can define how many CPU core you allow onnxruntime to use. If left undefined, it will aim to use most of the available cores to keep the inference speed down, which will result in increased CPU usage. By setting the cpu core hyperparameter, you can adjust inference speed vs resource usage.

Why is it faster without any optimisation?

Actually, it does apply some optimisation at the first run and during the model load by defining the run scheme over available resources to minimise the latency.

Deniz Beker
  • 1,984
  • 1
  • 18
  • 22