1

I am building and testing OpenPAI v0.14.0. Previously, I have built OpenPAI on a 1-node 4-gpu machine, and have used it for 4-gpu distributed-parallel processing.

This time, a new 1-node 2-gpu machine came in and connected the two nodes. OpenPAI dashboard showed that there are 6 gpu available. However, when I tried to assign one job to 6-gpu, I get Exit Code: -7200, Exit Reason: maxGPUs = 4.

I think that maxGPU = 4, the largest number of GPUs on one node. Is GPU distribution supported by OpenPAI only possible on one node?

I found Distributed Job Examples on the openpai.readthedocs.io site. https://openpai.readthedocs.io/en/latest/manual/cluster-user/advanced-jobs.html#distributed-job-examples

One of the two examples here, TensorFlow CIFAR10, seems to be distributing different jobs to different nodes, i.e. the parameter server and worker. For the other example, Horovod PyTorch, it seems that not only GPU distributed code is written, but already using OpenMPI to distribute nodes inside the code.

Can I use multi-node distributed-GPU parallelization in OpenPAI only when distributed programming to use multi-node directly at the code level using OpenMPI?

Doesn't OpenPAI automatically handle multi-node distributed multi-GPU parallel programming using only CUDA library?

Thank you.

jsh-fw
  • 11
  • 1
  • pls share your job config, did you use multiple task roles or only one task in one task role? The distributed logic is NOT handled by OpenPAI, it is handled by the deep learning framework you used, e.g. parameter server, all reduce, etc. OpenPAI only schedules and runs your jobs in containers. – abuccts May 25 '20 at 02:53
  • Thx for reply! That's right answer I want to know :) – jsh-fw May 25 '20 at 04:54

0 Answers0