I'm using OpenCL through the C# Cloo interface and I'm running into some very frustrating issues when trying to get it running well in our product.
Without giving too much away, our product is a computer vision product which, thirty times a second, gets a 512x424 grid of pixel values from our camera. We want to do computations on those pixels to generate point clouds relative to certain objects in the scene.
What I'm doing to try computing these pixels is, when we get a new frame, the following (every frame):
1) Create a CommandQueue, 2) Create a buffer that's read only for the input pixel values, 3) Create a zero-copy buffer that's write only for the output point values. 4) Pass in the matrices for doing the computation on the GPU, 5) Execute the kernel and wait for the response.
An example of the per-frame work is this:
// the command queue is the, well, queue of commands sent to the "device" (GPU)
ComputeCommandQueue commandQueue = new ComputeCommandQueue(
_context, // the compute context
_context.Devices[0], // first device matching the context specifications
ComputeCommandQueueFlags.None); // no special flags
Point3D[] realWorldPoints = points.Get(Perspective.RealWorld).Points;
ComputeBuffer<Point3D> realPointsBuffer = new ComputeBuffer<Point3D>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
realWorldPoints);
_kernel.SetMemoryArgument(0, realPointsBuffer);
Point3D[] toPopulate = new Point3D[realWorldPoints.Length];
PointSet pointSet = points.Get(perspective);
ComputeBuffer<Point3D> resultBuffer = new ComputeBuffer<Point3D>(_context,
ComputeMemoryFlags.UseHostPointer,
toPopulate);
_kernel.SetMemoryArgument(1, resultBuffer);
float[] M = new float[3 * 3];
ReferenceFrame referenceFrame =
perspectives.ReferenceFrames[(int)Perspective.Floor];
AffineTransformation transform = referenceFrame.ToReferenceFrame;
M[0] = transform.M00;
M[1] = transform.M01;
M[2] = transform.M02;
M[3] = transform.M10;
M[4] = transform.M11;
M[5] = transform.M12;
M[6] = transform.M20;
M[7] = transform.M21;
M[8] = transform.M22;
ComputeBuffer<float> Mbuffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
M);
_kernel.SetMemoryArgument(2, Mbuffer);
float[] b = new float[3];
b[0] = transform.b0;
b[1] = transform.b1;
b[2] = transform.b2;
ComputeBuffer<float> Bbuffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
b);
_kernel.SetMemoryArgument(3, Bbuffer);
_kernel.SetValueArgument<int>(4, (int)Perspective.Floor);
//sw.Start();
commandQueue.Execute(_kernel,
new long[] { 0 }, new long[] { toPopulate.Length }, null, null);
IntPtr retPtr = commandQueue.Map(
resultBuffer,
true,
ComputeMemoryMappingFlags.Read,
0,
toPopulate.Length, null);
commandQueue.Unmap(resultBuffer, ref retPtr, null);
When profiling, the time is WAAAY too long, and 90% of the time is made up in the creation of all the ComputeBuffer objects, etc. The actual compute time on the GPU is fast as can be.
My question is, how do I fix this? The array of pixels that come in is DIFFERENT for every frame, so I have to create a new ComputeBuffer for that. Our matrices can change periodically too as we update the scene (again, I can't go into all the details). Is there a way to update those buffers ON the GPU? I'm using an Intel GPGPU and so I have shared memory and can theoretically do that.
It's becoming frustrating because time and time again the speed gains I'm finding on the GPU are swamped with the overhead of setting everything up for every frame.
Edit 1:
I don't think my original code examples really showed off what I'm doing well enough, so I created a real-world, working example and posted it on github here.
I'm not able to change, due to legacy reasons and time reasons, too much of the overriding architecture of our current product. I'm trying to "drop in" GPU code in certain parts that are slow in order to speed it up. It's possible this may just not be possible given the restraints I'm seeing. However, let me better explain what I'm doing.
I'll give the code, but I'm going to be referring to the function "ComputePoints" in the class "GPUComputePoints".
As you can see in my ComputePoints function, each time a CameraFrame is passed in as well as the transformation matrices M and b.
public static Point3D[] ComputePoints(CameraFrame frame, float[] M, float[] b)
These are new arrays that would be generated from our pipeline and not arrays that I can leave hanging around. So I create a new ComputeBuffer for each:
ComputeBuffer<ushort> inputBuffer = new ComputeBuffer<ushort>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
frame.RawData);
_kernel.SetMemoryArgument(0, inputBuffer);
Point3D[] ret = new Point3D[frame.Width * frame.Height];
ComputeBuffer<Point3D> outputBuffer = new ComputeBuffer<Point3D>(_context,
ComputeMemoryFlags.WriteOnly | ComputeMemoryFlags.UseHostPointer,
ret);
_kernel.SetMemoryArgument(1, outputBuffer);
ComputeBuffer<float> mBuffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
M);
_kernel.SetMemoryArgument(2, mBuffer);
ComputeBuffer<float> bBuffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
b);
_kernel.SetMemoryArgument(3, bBuffer);
...and therein lies the drain on performance, I believe. It was mentioned that to get around this use the map/unmap functionality. But I fail to see how this will help because I'll still need to create the buffers each time to encapsulate the new arrays being passed in, right?