Running CUDA kernel on distributed memory with MPI

Question

I'm running my program in a cluster. Each node has 2 GPUs. Each MPI task is to call a CUDA function.

My question is if there are two mpi processes running on each node, will each CUDA function call be scheduled on different GPUs or will they both run on the same? What about if I run 4 mpi tasks on each node?

score 3 · Answer 1 · answered Apr 29 '12 at 10:39

3

Each MPI task calls one cuda function that is scheduled on whatever GPU you choose. You can choose the GPU you want using the function cudaSetDevice(). In your case, since each node contains 2 GPUs you can switch between every GPU with cudaSetDevice(0) and cudaSetDevice(1). If you don't specify the GPU using the SetDevice function and combining it with the MPI task rank, I believe the 2 MPI tasks will run both cuda functions on the same default GPU (numbered as 0) serially. Furthermore, if you run 3 or more mpi tasks on each node, you will have a race condition for sure, since 2 or more cuda functions will run on the same GPU serially.

answered Apr 29 '12 at 10:39

chemeng

456
1
5
18

Are you sure about both cuda function using same device i.e. default device ? I would assume it should be fastest unused device , which in case of 2 gpus will force to use both. – arbitUser1401 Apr 29 '12 at 14:23
I just tested it on my own, and no matter how many mpi-tasks I created, all of them used Device 0 for the cuda calls. I don't think there's any optimization function that distributes the load,since each MPI task is independent. You have to do it manually with CudaSetDevice. – chemeng Apr 29 '12 at 14:50

score 2 · Accepted Answer · answered Apr 29 '12 at 10:36

2

MPI and CUDA are basically orthogonal. You will have to explicitly manage MPI process-GPU affinity yourself. To do this, compute exclusive mode is pretty much mandatory for each GPU. You can use a split communicator with coloring to enforce processor-GPU affinity once each process has found a free device it can establish a context on.

Massimo Fatica from NVIDIA posted a useful code snippet on the NVIDIA forums a while ago that might get you started.

answered Apr 29 '12 at 10:36

talonmies

70,661
34
192
269

It worth noting, that in thread/process-exclusive compute-mode it's not always necessary to manage GPUs explicitly: pretty often just letting the driver choose the device would do the trick. – aland Apr 29 '12 at 19:49
@aland . In my case it doesn't seem happening chemeng also experience similar thing. – arbitUser1401 Apr 30 '12 at 00:25

Running CUDA kernel on distributed memory with MPI

2 Answers2