GPU 0 utilization higher than other GPUs on Amazon SageMaker SMDP (distributed Training)

Asked Sep 08 '22 at 18:21

Active Oct 07 '22 at 08:24

Viewed 36 times

When using SageMaker Data Parallelism (SMDP), my team sees a higher utilization on GPU 0 compared to other GPUs. What can be the likely cause here? Does it have anything to do with the data loader workers that run on CPU? I would expect SMDP to shard the datasets equally.

edited Oct 07 '22 at 08:24

asked Sep 08 '22 at 18:21

Philipp Schmid

Is this behaviour noted throughout the training or only at the start. How much is gpu 0 utilization higher than other gpus? Does the throughput scale well as you increase the cluster size? Also please make sure to use the distributed DataLoader. – Arun Lokanatha Sep 15 '22 at 00:23

GPU 0 utilization higher than other GPUs on Amazon SageMaker SMDP (distributed Training)

0 Answers0