Why Onnxruntime runs 2-3x slower in C++ than Python?

Question

I have a code that runs 3 inference sessions one after the other. The problem that I am having is that it only runs at top performance in my Mac and the Windows VM (VMWare) that runs in my Mac. It takes between 58-68s to run my test set.

When I ask someone else using windows (with similar hardware: Intel i7 6-8 cores) to test, it runs in 150s. If I ask the same person to run the inference using an equivalent python script, it runs 2-3x faster than that, on par with my original Mac machine.

I have no idea what else to try. Here is the relevant part of the code:

#include "onnxruntime-osx-universal2-1.13.1/include/onnxruntime_cxx_api.h"
// ...
Ort::Env OrtEnv;
Ort::Session objectNet{OrtEnv, objectModelBuffer.constData(), (size_t) objectModelBuffer.size(), Ort::SessionOptions{}}; // x3, one for each model

std::vector<uint16_t> inputTensorValues;
normalize(img, {aiPanoWidth, aiPanoHeight}, inputTensorValues); // convert the cv:Mat imp into std::vector<uint16_t>

std::array<int64_t, 4> input_shape_{ 1, 3, aiPanoHeight, aiPanoWidth };

auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input_tensor_ = Ort::Value::CreateTensor(allocator_info, inputTensorValues.data(), sizeof(uint16_t) * inputTensorValues.size(), input_shape_.data(), input_shape_.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT16);

const char* input_names[] = { "images" };
const char* output_names[] = { "output" };
std::vector<Ort::Value> ort_outputs = objectNet.Run(Ort::RunOptions{ nullptr }, input_names, &input_tensor_, 1, output_names, 1);

//... after this I read the output, but the step above is already 2-3x slower on C++ than Python

Some more details:

The code above runs in the background in a worker thread (needed since GUI runs in the main thread)
I am using float16 to reduce the footprint of the AI models
I used the vanila onnxruntime dlls provided by Microsoft (v1.13.1)
I compiled my code with both Mingw Gcc and VC++2022. The result is similar in both with a small advantage to VC++. I believe that other parts of my code runs faster, and not necessarily the inference.
I don't want to run it in the GPU.
I'm compiling with /arch:AVX /openmp -O2 and -lonnxruntime

How are you compiling it? A pre-existing Python extension is probably compiled with optimizations enabled, but most C/C++ code is, by default, compiled without optimizations enabled, and that will murder the performance. If the Mac happens to (by default or manual configuration) compile with optimizations, the Windows box without, this wouldn't be unexpected. — ShadowRanger, Jan 26 '23 at 00:18
@ShadowRanger I did not compile the onnxruntime lib files. I downloaded them straight from Microsoft GitHub repo. I assumed that they were compiled all with the same configuration. — Adriel Jr, Jan 26 '23 at 00:22
@ShadowRanger I compiled again and the performance was about the same, so it did not help. I later was able to get massive improvements by tweaking the session options as explained in my answer below. — Adriel Jr, Jan 27 '23 at 23:14

Adriel Jr · Answer 1 · 2023-01-28T02:20:57.623

1

After almost a week of restless profiling, I was able to improve significantly (up to 2x) the performance on Windows PCs by tweaking the threads options for the session.

Ort::SessionOptions s;
s.SetInterOpNumThreads(1);
s.SetIntraOpNumThreads(std::min(6, (int) std::thread::hardware_concurrency()));
s.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);

Ort::Session objectNet{OrtEnv, objectModelBuffer.constData(), (size_t) objectModelBuffer.size(), s};

What I think that was happening was that OnnxRuntime was allocating an excessive number of threads and intra-thread communication/synchronization overhead became significant.

Since hardcoded values is not a good practice, I pulled the number of CPUs from the thread library and set Onnxruntime for this maximum value (or the maximum of 6). I'm afraid of trying to increase to more than 6 and get poor results again. I tested on my Mac (6 cores i7) with this setup and the performance was the same as before. In my Windows VM it got 22% faster than before. In my friend's Windows PC (8 cores i7) it got 2x faster.

I was really hoping that OnnxRuntime would do a better job optimizing for the available resources.

Another thing that I need to mention was that reverting back the model from FP16 to FP32 helped a little with this result, especially in the Windows PC platform. In my Mac and Windows VM the difference was negligible.

edited Jan 28 '23 at 02:20

answered Jan 27 '23 at 23:13

Adriel Jr

2,451
19
25

@ShadowRanger I'm compiling my application with -O2. onnxruntime library was compiled with default instructions - I assume that does enable optimization. – soumeng78 Aug 25 '23 at 02:09
I tried your suggestions but it slowed down further. I tried changing inter and intraop thread counts with various other numbers, and also tried SetExecutionMode(ORT_PARALLEL) but nothing worked. – soumeng78 Aug 25 '23 at 02:11
I'm assuming onnxrt Python version is also using threads when doing batch prediction where as in C++ code as of now I'm not using batch prediction. Is there a way to disable threading in Python while doing batch prediction? – soumeng78 Aug 25 '23 at 02:14
@soumeng78 What certainly will help is quantizing your model to INT8. In this case it should be at least 2x faster, but with a small penalty in quality. – Adriel Jr Aug 25 '23 at 13:42
thank you. I'll try that. Do you have a pointer to a reference code how to do this using onnx? I'm trying onnx for first time. – soumeng78 Aug 25 '23 at 17:25

Why Onnxruntime runs 2-3x slower in C++ than Python?

1 Answers1