There is a program with a certain amount of kernel (about 30), which operate on common data (buffers - cl_mem) and are called in a certain sequence in a loop a large number of times (for example, 100000). The amount of data transferred to the video card is about x*10 MB, but the application memory allocation (RAM usage) reaches several GB (both in the task manager and in memory usage in Visual Studio).
The algorithm is as follows:
- initialization of opencl (
device, context, queue, programm
) - loading source data into RAM
- allocation of a large number of buffers for intermediate data
cl_mem Xi = clCreateBuffer (...
- reloading the source data from RAM to the buffer with
elquequeWriteBuffer (...
- kernel designation -
cl_kernel k_i = clCreateKernel (...
- setting arguments for each kernel -
clSetKernelArg (...
- designation of
global
andlocal
sizes for each kernel - start a loop with a sequence of executable kenel using
clEnqueueNDRangeKernel (...
for (long int i=0; i<100000; i++) {
err = clEnqueueNDRangeKernel (...
err |= clEnqueueNDRangeKernel (...
...
}
- read buffer
clEnqueueReadBuffer(
OpenCl counts everything. The expected result is obtained. But CPU RAM memory usage grows abnormally / disproportionately to the amount of processed data.
I don't understand. I use cl_mem Xi = clCreateBuffer (...
only one time, I don't create buffer in each iteration, and I creat these buffers only in GPU memory:
cl_mem buf_input = clCreateBuffer(context, CL_MEM_READ_WRITE, X * Y * sizeof(float), NULL, &err);
Why so high memory usage at host (CPU RAM)?
I tried various clRelease ... (..MemObj, ..Kernel, etc.)
. Also I tried so that the program fully enters the loop (i.e. with overloading and re-copying of the cl-code). All good, but memory ... is growing! And only the application memory. On the GPU RAM usage is about x*10 megabytes.