Why is the response time of ONNX Runtime increasing?

Question

I'm using from ONNX Runtime library to inference my deep neural network model in c++. The inference time on the CPU is about 10 milliseconds. When I use the GPU(Nvidia 1050 ti), the inference time is about 4ms for about the first minute after the first processing, but after about 1 minute after first processing the time suddenly increases over 25ms. What is the problem? I am using CUDA 11.8 and the following options are enabled in using the ONNX Runtime:

sessionOptions.SetIntraOpNumThreads(1);
sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
OrtCUDAProviderOptions cuda_options;
cuda_options.device_id = 0;
sessionOptions.AppendExecutionProvider_CUDA(cuda_options);

And im calculating time of inference so this:

auto start = std::chrono::high_resolution_clock::now();
my_session->Run(Ort::RunOptions{ nullptr }, inputNames.data(),inputTensors.data(), 1, outputNames.data(),
    outputTensors.data(), 2);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Time taken by function: "
    << duration.count() << " microseconds" << endl;

and result:

Why is the response time of ONNX Runtime increasing?

0 Answers0