0

I have two GPUs, one kernel, a single context and two command queues (1 per each GPU). I have tried to run them in a loop where each command queue is run and then I have tried both queue.finish() and queue.flush() in hope of running the work on the GPUs simultaneously.

But what happens actually is that the data is sent to one device first, the GPU performs its work, and then the other GPU starts working. It takes twice as much time as with a single GPU. Which is not what I intend to achieve!

Although I am also reading the buffers back into the host code, and one might think that that could be a problem for the second GPU to wait for the 1st one's result. But I also commented out the reading back of the results without any luck. Its still the same.

for (unsigned int iter = 0; iter < numberOfDevices; iter++) {
    // Load in kernel source, creating a program object for the context
     cl::Program programGA(context, stringifiedSourceCL, true);

    // Create the kernel functor
    auto kernelGA = cl::make_kernel<cl::Buffer,
                                    cl::Buffer,
                                    cl::Buffer>
                                    (programGA, "kernelGA");

    // CREATE THE BUFFERS.

    d_pop = cl::Buffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
                      (Length * POP_SIZE * sizeof(double)),
                       pop);
    // And other buffers...

    // Enqueue the kernel.
    kernelGA(cl::EnqueueArgs(queue[iter],
                             cl::NDRange(POP_SIZE / numberOfDevices)),
                             d_integerParameters,
                             d_doubleParameters, ... and so on...);

    // Enqueue in the corresponding device.
    queue[iter].flush();

    // Get results from the queue.
    queue[iter].enqueueReadBuffer(buf_half_population,
                                        true,
                                        0,
                                        popSizeMD * sizeof(double),
                                        popMD[iter]);

    // Add up the results after every iteration.
    for (int in_iter = 0; in_iter < populationSizeMD; in_iter++, it_j++) {
         population[it_j] = populationMD[iter][in_iter];
    }
}

My question is: What should I do to acheive true Concurrency and make the GPUs run simultaneously without waiting for the result of the other? Should I create two contexts? Should I do something else?

Keeping in mind that there is one kernel

Mohammad Sohaib
  • 577
  • 3
  • 11
  • 28

1 Answers1

0

ClFinish is a blocking command.

You need hostside concurrency + multiple contexts(1 for each device) or delayed flush/finish for all queues after queueing all commands for all queues.

For host side concurrency,

Convert

for (unsigned int iter = 0; iter < numberOfDevices; iter++) {...}

to

Concurrent.for(){} // if there is any for the language you working on

Parallel.For(0,n,i=>{...}); // C#

versions so each iteration is concurrent. For example, Parallel.For is working in C#. Then be sure about working on different ranges of arrays so buffer copy actions do not coincide. If there is any pci-e bandwidth starvation, you can copy to gpu-1 in first iteration, compute on gpu-1 + copy to gpu-2 on second iteration, get results from gpu-1 and compute on gpu-2 at third iteration, get results from gpu-2 at last iteration. If there is no starvation, you can do all copies + all computes + all results in different loops as:

Parallel.For( ... copy to gpus)
sync_point() ---> because other gpus result can change some input arrays,
             need to be sure all gpus have their own copies/buffers updated
             but not needed if it is an embarrassingly parallel workload
Parallel.For( ... compute on gpus + get results)

For delayed finish/flush:

 for(){...} // divide work into 4-8 parts per gpu, 
               so all gpu can have its turn without waiting much
               computing concurrently between mgpus
 flush1                        
 flush2
 finish1
 finish2

so they both start issuing works to gpus simultaneously. This code's performance should be dependent on gpu drivers while host side concurrency performance depends on your optimizations.

First type is easier for me because I can get better timing data for each device to loadbalance the work accross all gpus(not just split it half, altering accordingly with time spent on each gpu, buffer copies and range of works). But the second type should be faster if drivers are managing copies better. Especially if you are doing map/unmap instead of write/read because map/map uses dma engines instead of cpu when getting results or copying to gpu.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97