For example, I have built a pai cluster with 2 workers, and each worker has 2 GPUs. If I want to use four GPUs to run a task, can this cluster meet the demand and use both worker to run the task?
Asked
Active
Viewed 117 times
1 Answers
0
yes. please refer to distributed tensorflow example for details. https://github.com/Microsoft/pai/tree/master/examples/tensorflow#distributed-tensorflow-cifar-10-image-classification

fanyangCS
- 29
- 2
-
by the way, the term we use to run a distributed training is "job". task is with other meaning in OpenPAI. – fanyangCS Aug 21 '18 at 07:51
-
OK, whether I run a job in the cluster, with one master and one worker, both which has 2 GPUs, is the same as running a job int the cluster with two workers? – 邓泽帅 Aug 21 '18 at 08:21
-
The short answer is yes it can be the same. The long answer depends on how you compose the job submission file. In OpenPAI, a job may consist of multiple tasks, each is a container running in a physical node. OpenPAI can classify tasks into taskRole. each taskrole can have different behavior (e.g., ps server or work). for details, please refer to https://github.com/Microsoft/pai/blob/master/docs/job_tutorial.md – fanyangCS Aug 21 '18 at 08:34