I started playing with Alea GPU library for C# and I'm having great fun working with CUDA in familiar environment. However, I tackled an issue that I can't solve easily.
So I have this small portion of code written using Alea GPU:
Alea.Parallel.GpuExtension.For(gpu, 0, Points.Count, i =>
{
xComponent[i] = xComponent[i] - minX;
yComponent[i] = yComponent[i] - minY;
zComponent[i] = zComponent[i] - minZ;
});
And its trivial counterpart in C# using Parallel.For with the same code block working on components inside. Just for reference, the Points.Count is equal to around 2.7 millions and I'm running this code on Geforce GT 635M.
I started to compare the performance of these two approaches and noticed an unexpected behavior. On the first run, the code posted above is near 10 times slower than its CPU Parallel.For counterpart. Each next run worked as expected and was faster than the C# code.
I guess some kind of lazy compilation (similar to lazy loading) is performed on the CUDA code, and the time spent in the first run contains also the actual compilation time. So is there a simple way to enforce the precompilation of this code? I noticed that kernels can be compiled ahead of time, but I would prefer to keep my code simple using the Alea.Parallel.GpuExtension.For loop.