1

I'm experimenting with OpenCL (through Cloo's C# interface). To do so, I'm experimenting with the customary matrix-multiplication-on-the-GPU. The problem is, during my speed tests, the application crashes. I'm trying to be efficient regarding the the re-allocation of various OpenCL objects, and I'm wondering if I'm botching something in doing so.

I'll put the code in this question, but for a bigger picture, you can get the code from github here: https://github.com/kwende/ClooMatrixMultiply

My main program does this:

        Stopwatch gpuSw = new Stopwatch();
        gpuSw.Start();
        for (int c = 0; c < NumberOfIterations; c++)
        {
            float[] result = gpu.MultiplyMatrices(matrix1, matrix2, MatrixHeight, MatrixHeight, MatrixWidth);
        }
        gpuSw.Stop();

So I'm basically doing the call NumberOfIterations times, and timing the average execution time.

Within the MultiplyMatrices call, the first time through, I call Initialize to setup all the objects I'm going to reuse:

    private void Initialize()
    {
        // get the intel integrated GPU
        _integratedIntelGPUPlatform = ComputePlatform.Platforms.Where(n => n.Name.Contains("Intel")).First();

        // create the compute context. 
        _context = new ComputeContext(
            ComputeDeviceTypes.Gpu, // use the gpu
            new ComputeContextPropertyList(_integratedIntelGPUPlatform), // use the intel openCL platform
            null,
            IntPtr.Zero);

        // the command queue is the, well, queue of commands sent to the "device" (GPU)
        _commandQueue = new ComputeCommandQueue(
            _context, // the compute context
            _context.Devices[0], // first device matching the context specifications
            ComputeCommandQueueFlags.None); // no special flags

        string kernelSource = null;
        using (StreamReader sr = new StreamReader("kernel.cl"))
        {
            kernelSource = sr.ReadToEnd();
        }

        // create the "program"
        _program = new ComputeProgram(_context, new string[] { kernelSource });

        // compile. 
        _program.Build(null, null, null, IntPtr.Zero);
        _kernel = _program.CreateKernel("ComputeMatrix");
    }

I then enter the main body of my function (the part that will be executed NumberOfIterations times).

         ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly| ComputeMemoryFlags.CopyHostPointer,
                matrix1);
        _kernel.SetMemoryArgument(0, matrix1Buffer);

        ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix2);
        _kernel.SetMemoryArgument(1, matrix2Buffer);

        float[] ret = new float[matrix1Height * matrix2Width];
        ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.WriteOnly | ComputeMemoryFlags.CopyHostPointer,
            ret);
        _kernel.SetMemoryArgument(2, retBuffer);

        _kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
        _kernel.SetValueArgument<int>(4, matrix2Width);

        _commandQueue.Execute(_kernel,
            new long[] { 0 },
            new long[] { matrix2Width ,matrix1Height },
            null, null);

        unsafe
        {
            fixed (float* retPtr = ret)
            {
                _commandQueue.Read(retBuffer,
                    false, 0,
                    ret.Length,
                    new IntPtr(retPtr),
                    null);

                _commandQueue.Finish();
            }
        }

The third or fourth time through (it's somewhat random, which hints at memory access issues), the program crashes. Here is my kernel (I'm sure there are faster implementations, but right now my goal is just to get something working without blowing up):

kernel void ComputeMatrix(
    global read_only float* matrix1,
    global read_only float* matrix2,
    global write_only float* output, 
    int matrix1WidthMatrix2Height,
    int matrix2Width)
{
    int x = get_global_id(0); 
    int y = get_global_id(1); 
    int i = y * matrix2Width + x; 

    float value = 0.0f; 
    // row y of matrix1 * column x of matrix2
    for (int c = 0; c < matrix1WidthMatrix2Height; c++)
    {
        int m1Index = y * matrix1WidthMatrix2Height + c;
        int m2Index = c * matrix2Width + x;

        value += matrix1[m1Index] * matrix2[m2Index]; 
    }
    output[i] = value; 
}

Ultimately the goal here is to better understand the zero-copy features of OpenCL (since I'm using Intel's integrated GPU). I have been having trouble getting it to work and so wanted to step back a bit to see if I understood even more basic things...apparently I don't as I can't get even this to work without blowing up.

The only other thing I can think of is it's how I'm pinning the pointer to send it to the .Read() function. But I don't know of an alternative.

Edit:

For what it's worth, I updated the last part of code (the read code) to this, and it still crashes:

_commandQueue.ReadFromBuffer(retBuffer, ref ret, false, null);
_commandQueue.Finish(); 

Edit #2

Solution found by huseyin tugrul buyukisik (see comment below).

Upon placing

matrix1Buffer.Dispose();
matrix2Buffer.Dispose();
retBuffer.Dispose(); 

At the end, it all worked fine.

  • 1
    one probability is you are re-creating without disposing buffer resources, which, crosses opencl limits and crashes. One possibility, ret.Length is in bytes and needs to be multiplied by sizeof(float) or better sizeof(cl_float) . opencl resources are needed to be disposed. if you re-use without destroying them (when out of scope), you don't have to set argument again and again.. if their scope is only `gpu.MultiplyMatrices(` you should move creation of buffers to init part. is it opencl 2.0 or 1.2 ? – huseyin tugrul buyukisik Feb 16 '17 at 19:54
  • You nailed it, my friend. I put Dispose calls at the end of the function to release the buffers and it fixed it. It turns out the GC isn't keeping up with the load on the GPU memory-wise. –  Feb 16 '17 at 19:59
  • 1
    GC should not be trusted. There must be some `using(){}` implementation or explicit de-allocate. Maybe C# is more reliable but java had issues. – huseyin tugrul buyukisik Feb 16 '17 at 20:00
  • Great. I'm seeing that now. I gave you credit in the answer above :) Thanks again, my friend. –  Feb 16 '17 at 20:02
  • Also making buffer read/write blocking by replacing false valued parameter with true value may be faster than adding finish command after that. – huseyin tugrul buyukisik Feb 16 '17 at 20:11

1 Answers1

1

OpenCl resources like buffers, kernels and commandqueues should be released after other resources that they are bound-to are released. Re-creating without releasing depletes avaliable slots quickly.

You have been re-creating arrays in a method of gpu and that was the scope of opencl buffers. When it finishes, GC cannot track opencl's unmanaged memory areas and that causes leaks, which makes crashes.

Many opencl implementations use C++ bindings which needs explicit release commands by C#, Java and other environments.

Also the set-argument part is not needed many times when repeated kernel executions use exact same buffer order as parameters of kernel.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • How familiar are you with the notion of zero-copy? Ultimately what I'm trying to do is prevent copying buffers around (I'm using the Intel Integrated GPU, and so the GPU and "host" CPU share the same address space). What needs to be changed above to support that? Will I still need to release the resources just the same? –  Feb 16 '17 at 20:10
  • map/unmap is zero-copy(with use_host_ptr) as i know. You are making buffer copies. Also you should query device parameters to know it has its own memory or really shares cpu memory – huseyin tugrul buyukisik Feb 16 '17 at 20:12
  • Okay. I will experiment with that and perhaps make another StackOverflow question if I cannot get it working. I've found a number of examples, but they're quite difficult to read. I'll most likely just make a new post so I keep this one on track. You've been very helpful with my last couple StackOverflow questions. I really appreciate it. –  Feb 16 '17 at 20:13
  • mappiing is usually harder than copying but it is faster for streaming scenarios. Especially with use_host_ptr with device's alignment value such as 4096's multiple – huseyin tugrul buyukisik Feb 16 '17 at 20:14