OpenCL unexpected high RAM usage

Question

There is a program with a certain amount of kernel (about 30), which operate on common data (buffers - cl_mem) and are called in a certain sequence in a loop a large number of times (for example, 100000). The amount of data transferred to the video card is about x*10 MB, but the application memory allocation (RAM usage) reaches several GB (both in the task manager and in memory usage in Visual Studio).

The algorithm is as follows:

initialization of opencl (device, context, queue, programm)
loading source data into RAM
allocation of a large number of buffers for intermediate data cl_mem Xi = clCreateBuffer (...
reloading the source data from RAM to the buffer with elquequeWriteBuffer (...
kernel designation - cl_kernel k_i = clCreateKernel (...
setting arguments for each kernel - clSetKernelArg (...
designation of global and local sizes for each kernel
start a loop with a sequence of executable kenel using clEnqueueNDRangeKernel (...

for (long int i=0; i<100000; i++) {
  err = clEnqueueNDRangeKernel (...
  err |= clEnqueueNDRangeKernel (...
  ...
  }

read buffer clEnqueueReadBuffer(

OpenCl counts everything. The expected result is obtained. But CPU RAM memory usage grows abnormally / disproportionately to the amount of processed data.

I don't understand. I use cl_mem Xi = clCreateBuffer (... only one time, I don't create buffer in each iteration, and I creat these buffers only in GPU memory:

cl_mem buf_input = clCreateBuffer(context, CL_MEM_READ_WRITE, X * Y * sizeof(float), NULL, &err);

Why so high memory usage at host (CPU RAM)?

I tried various clRelease ... (..MemObj, ..Kernel, etc.). Also I tried so that the program fully enters the loop (i.e. with overloading and re-copying of the cl-code). All good, but memory ... is growing! And only the application memory. On the GPU RAM usage is about x*10 megabytes.

Call `clFinish` within the `for` loop. Otherwise the queue is filled faster than it can be emptied, which may cause the high RAM usage. — ProjectPhysX, May 08 '20 at 06:53
If I use ```clFinish(queue)``` calculation speed significantly decresed (from 480 us per iteration to 680 us per iteration). Also GPU load decresed to 50-70% from 100%. But with ```clFinish``` behaviour of RAM usage changed: without this command RAM fast incresed and then not changed until ```for``` finished; with ```clFinish``` it incresed at fixed speed (lineary). — Almazra, May 08 '20 at 16:59
I try to use ```clReleaseCommandQueue``` after each macro-loop and ```clCreateCommandQueue``` before it: `for (int macro=0; macro < 100; macro++) { clCreateCommandQueue... for (long int micro=0; micro < 100000; micro++) { err = clEnqueueNDRangeKernel (... } clEnqueueReadBuffer(... clReleaseCommandQueue(queue); }` I hope that ```clRelease``` delete queue and clear CPU RAM like ```free(...)```. But RAM usage not decrese after ```clRelease``` — Almazra, May 08 '20 at 17:15

score 0 · Answer 1 · answered May 10 '20 at 10:30

Bingo! &event is main problem!

If I use

err |= clEnqueueNDRangeKernel(queue, kernel17, 2, NULL, tGlobal_k17, tLocal_k17, 0, NULL, &event);

Host memory incresed, but if I use

err |= clEnqueueNDRangeKernel(queue, kernel17, 2, NULL, tGlobal_k17, tLocal_k17, 0, NULL, NULL);

host memory is OK!

OpenCL unexpected high RAM usage

1 Answers1