Loop around GPULaunch or in the Kernel with ALEA Library

Question

I need to run my GPU Kernel (ALEA Library) 100 times with the same data using one integer (0-99) as the parameter. I tried to implement this loop in the Kernel but I got strange results. I had to take the loop out of the kernel and around the GPULaunch function like this :

var lp = new LaunchParam(GridDim, BlockDim);
for (int i= 0; i < 100; i++)
{
   GPULaunch(TestKernel, lp, Data, i);
}

The CPU version of the code is highly optimized and efficiently uses 4 cores (%100). After reorganizing the data in the memory according to coalesced memory access principles, I could have %92 Occupancy and %96 Global Load Efficiency. But still, the GPU version is only %50 faster than the CPU version. I have doubts whether looping GPULaunch is effective this way.

As you see in the graph below, I don't see repetitive memory transfers in NVIDIA Visual Profilier. Once I loaded the data to GPU (Not seen in the graph but not important for me), I get one short memory transfer of the output of 100 loops as seen at the right end. So my question is :

Does this method of calling GPULaunch in a loop have an unseen memory transfer of the same data?
If there is such an overhead, I need to have this loop in the Kernel. How can I do it. I tried but got unstable results, thinking that this method doesn't fit into GPU Parallel Programming Architecture.

Thanks in advance

Looks like you are still using AleaGPU 2.x, the newest one is AleaGPU 3.x, but with API changes (www.aleagpu.com). Also, what type is `HOIndex` and what type is `Data`? — Xiang Zhang, Jan 30 '17 at 09:27
I reinstalled Alea.IL.dll so now it is version 3.0.0.0 But the others are the latest version as per Nuget info. HOIndex is the integer "i" in fact bu I forgot to change it, I replaced it with i. Data represents two input arrays and two output arrays. This is a simplified template. Anyhow it didn't affect the loop. I tried the Kernel Loop again and it worked this time I don't know why. — Mehmet Bingöl, Jan 30 '17 at 18:34

Mehmet Bingöl · Answer 1 · 2017-01-30T18:35:19.923

I tried to implement the loop in the Kernel one more time and it worked. I am not sure what is different this time. Here is the obvious code (just template not a working code) :

public class GPUModule : ILGPUModule
{

  public GPUModule (GPUModuleTarget target) : base(target)
  {
  }

  [Kernel]
  Public MyKernel(deviceptr<int> Data)
  {
    var start = blockIdx.x * blockDim.x + threadIdx.x;
    int ind = threadIdx.x;

    for (int i=0;i<100;i++)
    {
      //Kernel Code here
    }
  }

  public void Dilimle_VerilerB(deviceptr<int> Data
  {
    ...
    var lp = new LaunchParam(GridDim, BlockDim);
    GPULaunch(TestKernel, lp, Data, HOIndex);
    ...
  }
}

The only addition to the Kernel was the loop on the integer "i". Unfortunately it caused the Register/Thread count to jump from 26 to 42, causing a drop in the Occupancy from %100 to %50 which slightly increased the execution time from 2.1 sec to 2.3 sec. So if %100 Occupancy was maintained, taking the loop into the Kernel would increase the performance hence eliminating the GPULaunch overhead considerably.

The %100 Occupancy with the loop around GPULaunch was working with 1024 threads/block. After the change to inner Kernel loop, I changed it to 128 threads/block. This increased the Ocuppancy to %62 and resulted in an execution time of 1.1 sec. So as a conclusion, taking such a loop into the Kernel provided a 2x increase in GPU performance.

So the question is why the Kernel's register/thread count increases from 26 to 42 with one addition of one loop of an integer to the Kernel. I suppose the Occupancy still could be close to %100 if the register count was around 30.

Loop around GPULaunch or in the Kernel with ALEA Library

1 Answers1