How do I reduce the overhead of buffer creation for OpenCL/Cloo (C#)?

Question

I'm using OpenCL through the C# Cloo interface and I'm running into some very frustrating issues when trying to get it running well in our product.

Without giving too much away, our product is a computer vision product which, thirty times a second, gets a 512x424 grid of pixel values from our camera. We want to do computations on those pixels to generate point clouds relative to certain objects in the scene.

What I'm doing to try computing these pixels is, when we get a new frame, the following (every frame):

1) Create a CommandQueue, 2) Create a buffer that's read only for the input pixel values, 3) Create a zero-copy buffer that's write only for the output point values. 4) Pass in the matrices for doing the computation on the GPU, 5) Execute the kernel and wait for the response.

An example of the per-frame work is this:

        // the command queue is the, well, queue of commands sent to the "device" (GPU)
        ComputeCommandQueue commandQueue = new ComputeCommandQueue(
            _context, // the compute context
            _context.Devices[0], // first device matching the context specifications
            ComputeCommandQueueFlags.None); // no special flags

        Point3D[] realWorldPoints = points.Get(Perspective.RealWorld).Points;
        ComputeBuffer<Point3D> realPointsBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
            realWorldPoints);
        _kernel.SetMemoryArgument(0, realPointsBuffer);

        Point3D[] toPopulate = new Point3D[realWorldPoints.Length];
        PointSet pointSet = points.Get(perspective);

        ComputeBuffer<Point3D> resultBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.UseHostPointer,
            toPopulate);
        _kernel.SetMemoryArgument(1, resultBuffer);
            float[] M = new float[3 * 3];
            ReferenceFrame referenceFrame =
                perspectives.ReferenceFrames[(int)Perspective.Floor];
            AffineTransformation transform = referenceFrame.ToReferenceFrame;
            M[0] = transform.M00;
            M[1] = transform.M01;
            M[2] = transform.M02;
            M[3] = transform.M10;
            M[4] = transform.M11;
            M[5] = transform.M12;
            M[6] = transform.M20;
            M[7] = transform.M21;
            M[8] = transform.M22;

            ComputeBuffer<float> Mbuffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
                M);
            _kernel.SetMemoryArgument(2, Mbuffer);

            float[] b = new float[3];
            b[0] = transform.b0;
            b[1] = transform.b1;
            b[2] = transform.b2;

            ComputeBuffer<float> Bbuffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
                b);
            _kernel.SetMemoryArgument(3, Bbuffer);

            _kernel.SetValueArgument<int>(4, (int)Perspective.Floor);

            //sw.Start();

            commandQueue.Execute(_kernel,
                new long[] { 0 }, new long[] { toPopulate.Length }, null, null);
            IntPtr retPtr = commandQueue.Map(
                resultBuffer,
                true,
                ComputeMemoryMappingFlags.Read,
                0,
                toPopulate.Length, null);

            commandQueue.Unmap(resultBuffer, ref retPtr, null);

When profiling, the time is WAAAY too long, and 90% of the time is made up in the creation of all the ComputeBuffer objects, etc. The actual compute time on the GPU is fast as can be.

My question is, how do I fix this? The array of pixels that come in is DIFFERENT for every frame, so I have to create a new ComputeBuffer for that. Our matrices can change periodically too as we update the scene (again, I can't go into all the details). Is there a way to update those buffers ON the GPU? I'm using an Intel GPGPU and so I have shared memory and can theoretically do that.

It's becoming frustrating because time and time again the speed gains I'm finding on the GPU are swamped with the overhead of setting everything up for every frame.

Edit 1:

I don't think my original code examples really showed off what I'm doing well enough, so I created a real-world, working example and posted it on github here.

I'm not able to change, due to legacy reasons and time reasons, too much of the overriding architecture of our current product. I'm trying to "drop in" GPU code in certain parts that are slow in order to speed it up. It's possible this may just not be possible given the restraints I'm seeing. However, let me better explain what I'm doing.

I'll give the code, but I'm going to be referring to the function "ComputePoints" in the class "GPUComputePoints".

As you can see in my ComputePoints function, each time a CameraFrame is passed in as well as the transformation matrices M and b.

public static Point3D[] ComputePoints(CameraFrame frame, float[] M, float[] b)

These are new arrays that would be generated from our pipeline and not arrays that I can leave hanging around. So I create a new ComputeBuffer for each:

       ComputeBuffer<ushort> inputBuffer = new ComputeBuffer<ushort>(_context,
          ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
          frame.RawData);
        _kernel.SetMemoryArgument(0, inputBuffer);

        Point3D[] ret = new Point3D[frame.Width * frame.Height]; 
        ComputeBuffer<Point3D> outputBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.WriteOnly | ComputeMemoryFlags.UseHostPointer,
            ret);
        _kernel.SetMemoryArgument(1, outputBuffer);

        ComputeBuffer<float> mBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            M);
        _kernel.SetMemoryArgument(2, mBuffer);

        ComputeBuffer<float> bBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            b);
         _kernel.SetMemoryArgument(3, bBuffer);

...and therein lies the drain on performance, I believe. It was mentioned that to get around this use the map/unmap functionality. But I fail to see how this will help because I'll still need to create the buffers each time to encapsulate the new arrays being passed in, right?

One obvious thing I noticed, do you need to create command queue per frame? Also you could reuse created buffers by writing new data into buffers. Have you tried pipe lining with multiple frame? copying one while processing other? — kanna, Feb 23 '17 at 23:50
That's true. I didn't think it would be that impactful. One question: am I doing the shared memory right? I think it's possible I'm still incurring the penalty for the read only buffer still. I'm looking at some sample code and thinking maybe that's what's killing me. — , Feb 23 '17 at 23:59
so you want a one-liner to add an non-editable project, maybe by another dll ? var result = compute(); you can pass a parameter to init, to release — huseyin tugrul buyukisik, Feb 24 '17 at 21:58
We can edit the product, it's just that the product is massive, and it would be very, very difficult to do so. It'd be much nicer if there was a way of not having to create compute buffers each time. I think what I need to know is whether there is a way to get around having to given my current restraints, or whether I'm going to HAVE to rewrite or rearchitect stuff to make it work with OpenCL? The sample I gave you shows, pretty well, how things are currently structured. So if you think what I want cannot be simply dropped in, say so. That's as good as any help. — , Feb 24 '17 at 22:03

huseyin tugrul buyukisik · Answer 1 · 2017-02-24T00:29:41.577

The array of pixels that come in is DIFFERENT for every frame, so I have to create a new ComputeBuffer for that.

You can create a one big buffer, then use its ranges for multiple different frames. Then you don't have to re-create(nor re-release) at each frame.

Our matrices can change periodically too as we update the scene (again, I can't go into all the details).

with each un-used buffer for N iterations/frames, you can release, for each non-enough buffer existence, you can release the last one and re-create 2x bigger one to use many more times before releasing again.

If number and order of kernel arguments stay same, they don't need to be set at each frame neither.

Is there a way to update those buffers ON the GPU?

Using device-side pointers on host-side or using host-side pointers on device-side is not recommended for opencl versions <=1.2 (no shared virtual memory?)

But it might work if it doesn't conflict with video-adapter or whatever is generating the video frame( and probably if use_host_ptr is used).

No need to re-create CommandQueue. Create once, use for every in-order work.

If you are re-creating all those because of software design similar to:

 float [] results = test(videoFeedData);

then you may try something like

float [] results = new float[n];
test(videoFeedData,results);

so it doesn't need to create everything, instead it gets size of result or input data, then creates opencl buffer once, cache it in somewhere like a map/dictionary, then re-use when similar sized array is taken.

Actual work would be like:

new frame feed-0: 1kB data ---> allocate 1kB
feed-1: 10 MB data ---> allocate 10 MB, delete 1kB one
feed-2: 3 MB data ---> re-use 10MB one
feed-3: 2 kB data ---> re-use 10MB 
feed-4: 100 MB data ---> delete 10MB, allocate 100MB
feed-5: 110 MB data ----> delete 100MB, allocate 200MB
feed-6: 120 MB data  ---> re-use 200 MB
feed-7: 150 MB data  ---> re-use 200 MB 
feed-8: 90 MB data  ---> re-use 200 MB

for both input and output data.

Recreating many things can hinder driver optimizations and reset, on top of the overhead of actual re-creation.

Maybe something like this:

 CoresGpu gpu = new CoresGpu(kernelString,options,"gpu");

 for(i 0 to 100)
 {
   float [] results = new float[n];

   // allocate new, if only not enough, deallocate old, if only not used
   gpu.compute(new object[]{getVideoFeedBuffer(),brush21x21array,results},
             new string[]{"input","input","output"},
             kernelName,numberOfThreads);

   toCloudDb(results.toList());
 }

 gpu.release(); // everything is released here

If re-creation is a must, no way to escape it, then you can even do pipelining to hide latency of re-creation(but still be slower than perfection).

push data
thread-0:get video feed

push data
thread-0:get next video feed
thread-1:send old video feed to gpu

push data
thread-0:get third video feed
thread-1:send second video feed to gpu
thread-2:compute on gpu

push data
thread-0:get fourth video feed
thread-1:send third video feed to gpu
thread-2:compute second frame on gpu
thread-3:get result of first frame from gpu to RAM

push data
thread-0:get fifth video feed
thread-1:send fourth video feed to gpu
thread-2:compute third frame on gpu
thread-3:get result of second frame from gpu to RAM
pop first data

...
...
pop second data

continues like this using something like:

var result=gpu.pipeline.push(videoFeed);
if(result!=null)
{ result has been popped! }

part of re-creation latency is hidden by compute, copy, videofeed, and popping operations. If re-creation is %90 of total time, then it would hide only %10. If it is %50 then it hides other %50.

5) Execute the kernel and wait for the response.

why wait? Are frames bound to each other? If not, you can use multiple pipelines too. Then you can re-create many buffers simultaneously in each pipeline so more work can be done but its too many wasted cycles. Using a big buffer for everything could be fastest.

Thanks for your comments. I am quickly responding to ask another question: yes, I'm trying to use shared virtual memory (no-copy). I'm assuming by creating a large buffering and writing to it you mean by using WriteBuffer...My question is how do I use this and zero copy, though? Is there really a huge penalty for creating those buffers each time too? — , Feb 24 '17 at 01:40
if host side buffer is not pinned+4096-aligned and is not flagged use_host_ptr, then it is copying somewhere else before communicating. — huseyin tugrul buyukisik, Feb 24 '17 at 09:59
and why aren't you using map unmap for input arrays? it looks like working but may not work for others — huseyin tugrul buyukisik, Feb 24 '17 at 10:05
Huseyin, I updated my original post to better reflect what I'm encountering. I also created source code so that you can see exactly what I'm encountering. Thanks. — , Feb 24 '17 at 21:45

How do I reduce the overhead of buffer creation for OpenCL/Cloo (C#)?

1 Answers1