0
----------------a.c---------------------
variable *XX;
func1(){
  for(...){
    for(i = 0; i < 4; i++)
       cutStartThread(func2,args)
  } 
}
---------------b.cu-------------------
func2(args){
  cudaSetDevice(i);
  xx = cudaMalloc();
  mykernel<<<...>>>(xx);
}
--------------------------------------

Recently, I want to use multiple GPU device for my program. There are four Tesla C2075 cards on my node. I use four threads to manage the four GPUs. What's more, the kernel in each thread is launched several times. A simple pseudo code as above. I have two questions:

  1. Variable XX is a very long string, and is read only in the kernel. I want to preserve it during the multiple launches of mykernel. Is it ok to call cudaMalloc and pass the pointer to mykernel only when mykernel is first launched? Or should I use __device__ qualifier?

  2. XX is used in four threads, so I declare it as a global variable in file a.c. Are multiple cudaMalloc of XX correct or should I use an array such as variable *xx[4]?

tomix86
  • 1,336
  • 2
  • 18
  • 29
Jacky Lau
  • 31
  • 8

1 Answers1

1
  1. For usage by kernels running on a single device, you can call cudaMalloc once to create your variable XX holding the string, then pass the pointer created by cudaMalloc (i.e. XX) to whichever kernels need it.

    #define xx_length 20
    char *XX;
    cudaMalloc((void **)&XX, xx_length * sizeof(char));
    ...
    kernel1<<<...>>>(XX, ...);
    ...
    kernel2<<<...>>>(XX, ...);
    etc.
    
  2. Create a separate XX variable for each thread, assuming that each thread is being used to access a different device. How exactly you do this will depend on the scope of XX. But an array of:

    char *XX[num_devices]; 
    

at global scope, should be OK.

The CUDA OpenMP sample may be of interest as an example of how to use multiple threads to manage multiple GPUs.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257