Optimization of conversion from opencv mat/Array to to OnnxRuntime Tensor?

Question

I am using the ONNXRuntime to inference a UNet model and as a part of preprocessing I have to convert an EMGU OpenCV matrix to OnnxRuntime.Tensor.

I achieved it using two nested for loops which is unfortunately quite slow:

            var data = new DenseTensor<float>(new[] { 1, 3, WIDTH, HEIGHT});

            for (int y = 0; y < HEIGHT; y++)
            {
                for (int x = 0; x < WIDTH; x++)
                {
                    data[0, 0, x, y] = image.GetValue(2, y, x)/255.0;
                    data[0, 1, x, y] = image.GetValue(1, y, x)/255.0;
                    data[0, 2, x, y] = image.GetValue(0, y, x)/255.0;
                }
            }

Then I found out that there exists a method which converts Array to DenseTensor. I wanted to use this method as follows:

        var imgToPredictFloat = new Mat(image.Height, image.Width, DepthType.Cv32F, 3);
        image.ConvertTo(imgToPredictFloat, DepthType.Cv32F, 1/255.0);
        CvInvoke.CvtColor(imgToPredictFloat, imgToPredictFloat, ColorConversion.Bgra2Rgb);

        var data = image.GetData().ToTensor<float>;
        var reshaped = data.Reshape(new int[] { 1, 3, WIDTH, HEIGHT});

This would greatly improve the performance however the layout of the output tensor is not correct (the same as from the for loop) and the model obviously won't work. Any suggestions how to reshape the array to the correct layout?

In the code is also performed converting int 0-255 to float 0-1 and BGR layout to RGB layout.

Just a suggestion in case reshaping won't work: there is a good chance `image.GetValue()` causes the performance drop because it probably does a boundary check. Maybe there are faster ways to iterate over `image`? — Good Night Nerd Pride, Jul 08 '21 at 14:58
Just from looking at your code, I would convert the Mat to CV_32F and scale, then [split](https://docs.opencv.org/3.4/d2/de8/group__core__array.html#ga0547c7fed86152d7e9d0096029c8518a) the channels. [`Mat.data`](https://docs.opencv.org/4.5.2/d3/d63/classcv_1_1Mat.html#a4d33bed1c850265370d2af0ff02e1564) gives an array pointer. — beaker, Jul 08 '21 at 15:04
Yeah, those optimizations sounds good! I will explore these as well. — Michal Cicatka, Jul 08 '21 at 15:48

score 2 · Accepted Answer · answered Jul 09 '21 at 07:22

This is how I have used cv::Mat with ONNX Runtime ( C++ ) :

const wchar_t* model_path = L"C:/data/DNN/ONNX/ResNet/resnet152v2/resnet152-v2-7.onnx";

printf("Using Onnxruntime C++ API\n");
Ort::Session session(env, model_path, session_options);


//*************************************************************************
// print model input layer (node names, types, shape etc.)
Ort::AllocatorWithDefaultOptions allocator;

size_t num_output_nodes = session.GetOutputCount();
std::vector<char*> outputNames;
for (size_t i = 0; i < num_output_nodes; ++i)
{
    char* name = session.GetOutputName(i, allocator);
    std::cout << "output: " << name << std::endl;
    outputNames.push_back(name);
}


// print number of model input nodes
size_t num_input_nodes = session.GetInputCount();
std::vector<const char*> input_node_names(num_input_nodes);
std::vector<int64_t> input_node_dims;  // simplify... this model has only 1 input node {1, 3, 224, 224}.
                                       // Otherwise need vector<vector<>>

printf("Number of inputs = %zu\n", num_input_nodes);

// iterate over all input nodes
for (int i = 0; i < num_input_nodes; i++) {
    // print input node names
    char* input_name = session.GetInputName(i, allocator);
    printf("Input %d : name=%s\n", i, input_name);
    input_node_names[i] = input_name;

    // print input node types
    Ort::TypeInfo type_info = session.GetInputTypeInfo(i);
    auto tensor_info = type_info.GetTensorTypeAndShapeInfo();

    ONNXTensorElementDataType type = tensor_info.GetElementType();
    printf("Input %d : type=%d\n", i, type);

    // print input shapes/dims
    input_node_dims = tensor_info.GetShape();
    printf("Input %d : num_dims=%zu\n", i, input_node_dims.size());
    for (int j = 0; j < input_node_dims.size(); j++)
        printf("Input %d : dim %d=%jd\n", i, j, input_node_dims[j]);
}


cv::Size dnnInputSize;
cv::Scalar mean;
cv::Scalar std;
bool rgb = true;

//cv::Mat inputImage = cv::imread("C:/TestImages/kitten_01.jpg");
cv::Mat inputImage = cv::imread("C:/TestImages/slug_01.jpg");

rgb = true;
dnnInputSize = cv::Size(224, 224);
mean[0] = 0.485;
mean[1] = 0.456;
mean[2] = 0.406;
std[0] = 0.229;
std[1] = 0.224;
std[2] = 0.225;

cv::Mat blob;
// ONNX: (N x 3 x H x W)
cv::dnn::blobFromImage(inputImage, blob, 1.0 / 255.0, dnnInputSize, mean, rgb, false);

size_t input_tensor_size = blob.total();

std::vector<float> input_tensor_values(input_tensor_size);
for (size_t i = 0; i < input_tensor_size; ++i)
{
    input_tensor_values[i] = blob.at<float>(i);
}
std::vector<const char*> output_node_names = { outputNames.front() };

// create input tensor object from data values
auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input_tensor = Ort::Value::CreateTensor<float>(memory_info, input_tensor_values.data(), input_tensor_size, input_node_dims.data(), 4);
assert(input_tensor.IsTensor());

// score model & input tensor, get back output tensor
auto output_tensors = session.Run(Ort::RunOptions{ nullptr }, input_node_names.data(), &input_tensor, 1, output_node_names.data(), 1);
assert(output_tensors.size() == 1 && output_tensors.front().IsTensor());

// Get pointer to output tensor float values
float* floatarr = output_tensors.front().GetTensorMutableData<float>();
assert(abs(floatarr[0] - 0.000045) < 1e-6);

cv::Mat1f result = cv::Mat1f(1000, 1, floatarr);

cv::Point classIdPoint;
double confidence = 0;
minMaxLoc(result, 0, &confidence, 0, &classIdPoint);
int classId = classIdPoint.y;
std::cout << "confidence: " << confidence << std::endl;
std::cout << "class: " << classId << std::endl;

The actual conversion part that you need is imho (adjust size and mean/std according to your network):

cv::Mat inputImage = cv::imread("C:/TestImages/slug_01.jpg");

rgb = true;
dnnInputSize = cv::Size(224, 224);
mean[0] = 0.485;
mean[1] = 0.456;
mean[2] = 0.406;
std[0] = 0.229;
std[1] = 0.224;
std[2] = 0.225;

cv::Mat blob;
// ONNX: (N x 3 x H x W)
cv::dnn::blobFromImage(inputImage, blob, 1.0 / 255.0, dnnInputSize, mean, rgb, false);

Unfortunately, this does not work - the layout of the output array is different from the wanted one. The biggest obstacle is the "ToTensor()" method which I think rearanges the array. Otherwise the layout of the raw output from the blobFromImage() would work. — Michal Cicatka, Jul 12 '21 at 07:25
@MichalCicatka can you tell how the format of the tensor should look like? Some data format/arrangement definition? — Micka, Jul 12 '21 at 07:29
That should be cleared by the nested for-loop in the original post. 4D Dense tensor where first dimension is going to be always 1 (batch/number of images), second channel, third width/x-coordinate and and fourt height/y-coordinate. I used this example: https://www.onnxruntime.ai/docs/tutorials/tutorials/resnet50_csharp.html — Michal Cicatka, Jul 12 '21 at 10:49
should be the same as expected in my code sample and created by blobFromImage: // (N x 3 x H x W) where N = 1, except that W and H are switched? — Micka, Jul 12 '21 at 12:44
You're right! I swapped the H&W channels but I swapped it in the reconstruction part of my code as well so it was functioning correctly. The layout however was not the same as from the blobFromImage() which confused me. I tested the blobfromimage() function and unfortunately it's slightly slower than two nested for loops where one of them is replaced with Parallel.For... — Michal Cicatka, Jul 13 '21 at 06:39
ok, what does "is quite slow" mean exactly? It should be very fast compared to the actual processing of the DNN. — Micka, Jul 13 '21 at 07:26
I am processing 1900x1800 image and the CNN was trained on sliced images 512x512. This means that for inferencing one large image I have to crop the large image into sixteen smaller images and than reconstruct it back together. That means that even small performance lags will efect the overall performance quite heavily. Overall performance with BlobFromImage() was around 3000 ms, nested loops around 3600 ms and nested parallel for loop around 2500 ms. Is my explanation understandable? — Michal Cicatka, Jul 13 '21 at 08:08
so about 190 ms per blobFromImage call on a 512x512 subimage. That sounds way too slow to me. I would expect something like < 5 ms on a i7 desktop PC, but I didnt measure it. — Micka, Jul 13 '21 at 08:52
Yes, but the 190 ms contains inference (110 ms) and post-processing. For the parallel-for loop, the conversion itself lasts around 30 ms. Still not the 5 ms you are talking about though. I would guess that it could be faster hence the question. — Michal Cicatka, Jul 13 '21 at 09:58

Optimization of conversion from opencv mat/Array to to OnnxRuntime Tensor?

1 Answers1