I have a code that runs 3 inference sessions one after the other. The problem that I am having is that it only runs at top performance in my Mac and the Windows VM (VMWare) that runs in my Mac. It takes between 58-68s to run my test set.
When I ask someone else using windows (with similar hardware: Intel i7 6-8 cores) to test, it runs in 150s. If I ask the same person to run the inference using an equivalent python script, it runs 2-3x faster than that, on par with my original Mac machine.
I have no idea what else to try. Here is the relevant part of the code:
#include "onnxruntime-osx-universal2-1.13.1/include/onnxruntime_cxx_api.h"
// ...
Ort::Env OrtEnv;
Ort::Session objectNet{OrtEnv, objectModelBuffer.constData(), (size_t) objectModelBuffer.size(), Ort::SessionOptions{}}; // x3, one for each model
std::vector<uint16_t> inputTensorValues;
normalize(img, {aiPanoWidth, aiPanoHeight}, inputTensorValues); // convert the cv:Mat imp into std::vector<uint16_t>
std::array<int64_t, 4> input_shape_{ 1, 3, aiPanoHeight, aiPanoWidth };
auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input_tensor_ = Ort::Value::CreateTensor(allocator_info, inputTensorValues.data(), sizeof(uint16_t) * inputTensorValues.size(), input_shape_.data(), input_shape_.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT16);
const char* input_names[] = { "images" };
const char* output_names[] = { "output" };
std::vector<Ort::Value> ort_outputs = objectNet.Run(Ort::RunOptions{ nullptr }, input_names, &input_tensor_, 1, output_names, 1);
//... after this I read the output, but the step above is already 2-3x slower on C++ than Python
Some more details:
- The code above runs in the background in a worker thread (needed since GUI runs in the main thread)
- I am using float16 to reduce the footprint of the AI models
- I used the vanila onnxruntime dlls provided by Microsoft (v1.13.1)
- I compiled my code with both Mingw Gcc and VC++2022. The result is similar in both with a small advantage to VC++. I believe that other parts of my code runs faster, and not necessarily the inference.
- I don't want to run it in the GPU.
- I'm compiling with /arch:AVX /openmp -O2 and -lonnxruntime