0

There is a program with a certain amount of kernel (about 30), which operate on common data (buffers - cl_mem) and are called in a certain sequence in a loop a large number of times (for example, 100000). The amount of data transferred to the video card is about x*10 MB, but the application memory allocation (RAM usage) reaches several GB (both in the task manager and in memory usage in Visual Studio).

The algorithm is as follows:

  • initialization of opencl (device, context, queue, programm)
  • loading source data into RAM
  • allocation of a large number of buffers for intermediate data cl_mem Xi = clCreateBuffer (...
  • reloading the source data from RAM to the buffer with elquequeWriteBuffer (...
  • kernel designation - cl_kernel k_i = clCreateKernel (...
  • setting arguments for each kernel - clSetKernelArg (...
  • designation of global and local sizes for each kernel
  • start a loop with a sequence of executable kenel using clEnqueueNDRangeKernel (...
for (long int i=0; i<100000; i++) {
  err = clEnqueueNDRangeKernel (...
  err |= clEnqueueNDRangeKernel (...
  ...
  }
  • read buffer clEnqueueReadBuffer(

OpenCl counts everything. The expected result is obtained. But CPU RAM memory usage grows abnormally / disproportionately to the amount of processed data.

I don't understand. I use cl_mem Xi = clCreateBuffer (... only one time, I don't create buffer in each iteration, and I creat these buffers only in GPU memory:

cl_mem buf_input = clCreateBuffer(context, CL_MEM_READ_WRITE, X * Y * sizeof(float), NULL, &err);

Why so high memory usage at host (CPU RAM)?

I tried various clRelease ... (..MemObj, ..Kernel, etc.). Also I tried so that the program fully enters the loop (i.e. with overloading and re-copying of the cl-code). All good, but memory ... is growing! And only the application memory. On the GPU RAM usage is about x*10 megabytes.

  • Call `clFinish` within the `for` loop. Otherwise the queue is filled faster than it can be emptied, which may cause the high RAM usage. – ProjectPhysX May 08 '20 at 06:53
  • If I use ```clFinish(queue)``` calculation speed significantly decresed (from 480 us per iteration to 680 us per iteration). Also GPU load decresed to 50-70% from 100%. But with ```clFinish``` behaviour of RAM usage changed: without this command RAM fast incresed and then not changed until ```for``` finished; with ```clFinish``` it incresed at fixed speed (lineary). – Almazra May 08 '20 at 16:59
  • I try to use ```clReleaseCommandQueue``` after each macro-loop and ```clCreateCommandQueue``` before it: `for (int macro=0; macro < 100; macro++) { clCreateCommandQueue... for (long int micro=0; micro < 100000; micro++) { err = clEnqueueNDRangeKernel (... } clEnqueueReadBuffer(... clReleaseCommandQueue(queue); }` I hope that ```clRelease``` delete queue and clear CPU RAM like ```free(...)```. But RAM usage not decrese after ```clRelease``` – Almazra May 08 '20 at 17:15
  • Help! Help! Help! – Almazra May 10 '20 at 09:48

1 Answers1

0

Bingo! &event is main problem!

If I use

err |= clEnqueueNDRangeKernel(queue, kernel17, 2, NULL, tGlobal_k17, tLocal_k17, 0, NULL, &event);

Host memory incresed, but if I use

err |= clEnqueueNDRangeKernel(queue, kernel17, 2, NULL, tGlobal_k17, tLocal_k17, 0, NULL, NULL);

host memory is OK!