0

I need to run my GPU Kernel (ALEA Library) 100 times with the same data using one integer (0-99) as the parameter. I tried to implement this loop in the Kernel but I got strange results. I had to take the loop out of the kernel and around the GPULaunch function like this :

var lp = new LaunchParam(GridDim, BlockDim);
for (int i= 0; i < 100; i++)
{
   GPULaunch(TestKernel, lp, Data, i);
}

The CPU version of the code is highly optimized and efficiently uses 4 cores (%100). After reorganizing the data in the memory according to coalesced memory access principles, I could have %92 Occupancy and %96 Global Load Efficiency. But still, the GPU version is only %50 faster than the CPU version. I have doubts whether looping GPULaunch is effective this way.

As you see in the graph below, I don't see repetitive memory transfers in NVIDIA Visual Profilier. Once I loaded the data to GPU (Not seen in the graph but not important for me), I get one short memory transfer of the output of 100 loops as seen at the right end. So my question is :

  1. Does this method of calling GPULaunch in a loop have an unseen memory transfer of the same data?
  2. If there is such an overhead, I need to have this loop in the Kernel. How can I do it. I tried but got unstable results, thinking that this method doesn't fit into GPU Parallel Programming Architecture.

Thanks in advance

NVIdia Visual Profiler resutl

  • Looks like you are still using AleaGPU 2.x, the newest one is AleaGPU 3.x, but with API changes (www.aleagpu.com). Also, what type is `HOIndex` and what type is `Data`? – Xiang Zhang Jan 30 '17 at 09:27
  • I reinstalled Alea.IL.dll so now it is version 3.0.0.0 But the others are the latest version as per Nuget info. HOIndex is the integer "i" in fact bu I forgot to change it, I replaced it with i. Data represents two input arrays and two output arrays. This is a simplified template. Anyhow it didn't affect the loop. I tried the Kernel Loop again and it worked this time I don't know why. – Mehmet Bingöl Jan 30 '17 at 18:34

1 Answers1

0

I tried to implement the loop in the Kernel one more time and it worked. I am not sure what is different this time. Here is the obvious code (just template not a working code) :

public class GPUModule : ILGPUModule
{

  public GPUModule (GPUModuleTarget target) : base(target)
  {
  }

  [Kernel]
  Public MyKernel(deviceptr<int> Data)
  {
    var start = blockIdx.x * blockDim.x + threadIdx.x;
    int ind = threadIdx.x;

    for (int i=0;i<100;i++)
    {
      //Kernel Code here
    }
  }

  public void Dilimle_VerilerB(deviceptr<int> Data
  {
    ...
    var lp = new LaunchParam(GridDim, BlockDim);
    GPULaunch(TestKernel, lp, Data, HOIndex);
    ...
  }
}

The only addition to the Kernel was the loop on the integer "i". Unfortunately it caused the Register/Thread count to jump from 26 to 42, causing a drop in the Occupancy from %100 to %50 which slightly increased the execution time from 2.1 sec to 2.3 sec. So if %100 Occupancy was maintained, taking the loop into the Kernel would increase the performance hence eliminating the GPULaunch overhead considerably.

The %100 Occupancy with the loop around GPULaunch was working with 1024 threads/block. After the change to inner Kernel loop, I changed it to 128 threads/block. This increased the Ocuppancy to %62 and resulted in an execution time of 1.1 sec. So as a conclusion, taking such a loop into the Kernel provided a 2x increase in GPU performance.

So the question is why the Kernel's register/thread count increases from 26 to 42 with one addition of one loop of an integer to the Kernel. I suppose the Occupancy still could be close to %100 if the register count was around 30.