0

I started playing with Alea GPU library for C# and I'm having great fun working with CUDA in familiar environment. However, I tackled an issue that I can't solve easily.

So I have this small portion of code written using Alea GPU:

Alea.Parallel.GpuExtension.For(gpu, 0, Points.Count, i =>
        {
            xComponent[i] = xComponent[i] - minX;
            yComponent[i] = yComponent[i] - minY;
            zComponent[i] = zComponent[i] - minZ;
        });

And its trivial counterpart in C# using Parallel.For with the same code block working on components inside. Just for reference, the Points.Count is equal to around 2.7 millions and I'm running this code on Geforce GT 635M.

I started to compare the performance of these two approaches and noticed an unexpected behavior. On the first run, the code posted above is near 10 times slower than its CPU Parallel.For counterpart. Each next run worked as expected and was faster than the C# code.

I guess some kind of lazy compilation (similar to lazy loading) is performed on the CUDA code, and the time spent in the first run contains also the actual compilation time. So is there a simple way to enforce the precompilation of this code? I noticed that kernels can be compiled ahead of time, but I would prefer to keep my code simple using the Alea.Parallel.GpuExtension.For loop.

Konrad
  • 1,014
  • 1
  • 13
  • 24

1 Answers1

1

As far as I know it might be a mixture of the GPU waking up and JIT compilation. If you are going to execute that kernel a lot of times one kernel being slow might not affect you, I'm not familiar with that GPU library but you might want to compile for several gpu architectures thus avoiding a recompilation on your binary. You could also run a small kernel before this one to initialize and warmup the gpu.

Edit: Found this example in the Alea gpu webpage

aram
  • 1,415
  • 13
  • 27