0

For example, I have built a pai cluster with 2 workers, and each worker has 2 GPUs. If I want to use four GPUs to run a task, can this cluster meet the demand and use both worker to run the task?

1 Answers1

0

yes. please refer to distributed tensorflow example for details. https://github.com/Microsoft/pai/tree/master/examples/tensorflow#distributed-tensorflow-cifar-10-image-classification

fanyangCS
  • 29
  • 2
  • by the way, the term we use to run a distributed training is "job". task is with other meaning in OpenPAI. – fanyangCS Aug 21 '18 at 07:51
  • OK, whether I run a job in the cluster, with one master and one worker, both which has 2 GPUs, is the same as running a job int the cluster with two workers? – 邓泽帅 Aug 21 '18 at 08:21
  • The short answer is yes it can be the same. The long answer depends on how you compose the job submission file. In OpenPAI, a job may consist of multiple tasks, each is a container running in a physical node. OpenPAI can classify tasks into taskRole. each taskrole can have different behavior (e.g., ps server or work). for details, please refer to https://github.com/Microsoft/pai/blob/master/docs/job_tutorial.md – fanyangCS Aug 21 '18 at 08:34