I am building and testing OpenPAI v0.14.0. Previously, I have built OpenPAI on a 1-node 4-gpu machine, and have used it for 4-gpu distributed-parallel processing.
This time, a new 1-node 2-gpu machine came in and connected the two nodes. OpenPAI dashboard showed that there are 6 gpu available. However, when I tried to assign one job to 6-gpu, I get Exit Code: -7200, Exit Reason: maxGPUs = 4.
I think that maxGPU = 4, the largest number of GPUs on one node. Is GPU distribution supported by OpenPAI only possible on one node?
I found Distributed Job Examples on the openpai.readthedocs.io site. https://openpai.readthedocs.io/en/latest/manual/cluster-user/advanced-jobs.html#distributed-job-examples
One of the two examples here, TensorFlow CIFAR10, seems to be distributing different jobs to different nodes, i.e. the parameter server and worker. For the other example, Horovod PyTorch, it seems that not only GPU distributed code is written, but already using OpenMPI to distribute nodes inside the code.
Can I use multi-node distributed-GPU parallelization in OpenPAI only when distributed programming to use multi-node directly at the code level using OpenMPI?
Doesn't OpenPAI automatically handle multi-node distributed multi-GPU parallel programming using only CUDA library?
Thank you.