CUDA multi gpu: running same kernel (dual chip device)

Question

I'm writing code that must run same kernel on GTX690 (dual chip) board. Since computation is pretty separable, I don't need data exchange between devices, I just merge the result on the CPU. I understand how to run code on each device, and also, how to provide i/o data space in memory space of each device.

I'm encountering problems when trying to set constants, which the kernel that runs on both devices uses. Do I need to make an array for each device? If so, how can I check, within the kernel, which device is running so I can access the proper constant from the array?

Here is some code that I am using. It works properly when i use 1 chip only (by setting numDev = 1), but not on both.

__constant__ float d_cellSizeZ;  
std::vector<int*> d_cell;  
.................  
bool Init(int cellsN_, float size_){  
  bool res = true;
  if(cudaSuccess != cudaGetDeviceCount(&numDev))  
    return false;  
  //numDev = 1;  
  d_cl.resize(numDev);  
  for(int i = 0; i < numDev; ++i){  
    res &= (cudaSuccess == cudaSetDevice(i));  
    if(!res)  
      break;  
    res &= (cudaSuccess == cudaMalloc((void**)&d_cell[i], cellsN_*sizef(int)));  
  };  
  res &= (cudaSuccess == cudaMemcpyToSymbol(d_cellSizeZ, &size_, sizeof(float)));  
  if(!res)  
    Cleanup();  
  return res;  
}

In the kernel I just use the d_cellSizeZ constant. So, how should I make a constant for each device to be used nicely from the same kernel?

Another question is: if I try exchanging data between devices, will it go through the PCI bus or does there exist some internal path on dual-chip boards?

Each device has a different CUcontext. Calling cudaSetDevice switches host threads current active CUcontext . CUmodules are loaded per CUcontext. Currently you are only setting the symbol on device[numDev - 1]. Try moving cudaMemcpyToSymbol into the for loop. This will copy the results into each CUcontexts memory space. — Greg Smith, Sep 10 '12 at 14:40
yeah well, that is what i was thinking, but what is bothering me is if the constant is smth like pointer, so second call to it will overwrite the value set for previous device, or it is just some kind of name, so in each device it will correspond to own pointer in constant memory? — neworld, Sep 10 '12 at 16:03
unfortunately there is too little info about multi-GPU, and mostly it is related to professional multi-tesla systems. If You would advice some docs/examples concerning work with dual-chip deices i would be grateful :) — neworld, Sep 10 '12 at 16:06

score 2 · Accepted Answer · answered Sep 10 '12 at 16:25

You're only initialising d_cellSizeZ on the last device, so on the other devices it will be undefined. You need to initialise d_cellSizeZ on each of the devices, the easiest way is to do that inside the loop as Greg suggested in comments:

for(int i = 0; i < numDev; ++i)
{  
    checkCudaErrors(cudaSetDevice(i));
    checkCudaErrors(cudaMalloc((void**)&d_cell[i], cellsN_*sizef(int)));
    checkCudaErrors(cudaMemcpyToSymbol(d_cellSizeZ, &size_, sizeof(float)));
};

It does get a little weird around the reuse of the d_cellSizeZ symbol. There's a little bit of cleverness going on behind the scenes but essentially the cudaMemcpyToSymbol() function looks up the symbol on the currently active device, and therefore copies to the correct device each time.

CUDA multi gpu: running same kernel (dual chip device)

1 Answers1