I have two GPUs, one kernel, a single context and two command queues (1 per each GPU). I have tried to run them in a loop where each command queue is run and then I have tried both queue.finish()
and queue.flush()
in hope of running the work on the GPUs simultaneously.
But what happens actually is that the data is sent to one device first, the GPU performs its work, and then the other GPU starts working. It takes twice as much time as with a single GPU. Which is not what I intend to achieve!
Although I am also reading the buffers back into the host code, and one might think that that could be a problem for the second GPU to wait for the 1st one's result. But I also commented out the reading back of the results without any luck. Its still the same.
for (unsigned int iter = 0; iter < numberOfDevices; iter++) {
// Load in kernel source, creating a program object for the context
cl::Program programGA(context, stringifiedSourceCL, true);
// Create the kernel functor
auto kernelGA = cl::make_kernel<cl::Buffer,
cl::Buffer,
cl::Buffer>
(programGA, "kernelGA");
// CREATE THE BUFFERS.
d_pop = cl::Buffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
(Length * POP_SIZE * sizeof(double)),
pop);
// And other buffers...
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
cl::NDRange(POP_SIZE / numberOfDevices)),
d_integerParameters,
d_doubleParameters, ... and so on...);
// Enqueue in the corresponding device.
queue[iter].flush();
// Get results from the queue.
queue[iter].enqueueReadBuffer(buf_half_population,
true,
0,
popSizeMD * sizeof(double),
popMD[iter]);
// Add up the results after every iteration.
for (int in_iter = 0; in_iter < populationSizeMD; in_iter++, it_j++) {
population[it_j] = populationMD[iter][in_iter];
}
}
My question is: What should I do to acheive true Concurrency and make the GPUs run simultaneously without waiting for the result of the other? Should I create two contexts? Should I do something else?
Keeping in mind that there is one kernel