I am working on multiple machines and a single machine consists of two GPUs same as for the second machine. Overall, I have 4 GPUs in two machines. I am following the official example of PyTorch to train imagenet dataset. When I start the training on both machines the process is in the waiting stage and does not start training.
TCP Port
netstat -ntpl
Output
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 10.246.246.48:51651 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22443 0.0.0.0:* LISTEN -
tcp 0 0 10.246.246.48:37707 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:42803 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:6100 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:35129 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:6010 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:6011 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:58843 0.0.0.0:* LISTEN -
tcp6 0 0 :::39047 :::* LISTEN -
tcp6 0 0 :::22443 :::* LISTEN -
tcp6 0 0 :::35693 :::* LISTEN 14644/python
tcp6 0 0 :::111 :::* LISTEN -
tcp6 0 0 :::6100 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 ::1:631 :::* LISTEN -
tcp6 0 0 127.0.0.1:46617 :::* LISTEN -
tcp6 0 0 ::1:6010 :::* LISTEN -
tcp6 0 0 ::1:6011 :::* LISTEN -
tcp6 0 0 :::50907 :::* LISTEN -
Code Examples https://github.com/pytorch/examples/tree/main/imagenet
Node 1: python imagenet_multi_node.py -a resnet50 --dist-url tcp://127.0.0.1:35693 --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 /home2/coremax/Documents/ILSVRC/Data/CLS-LOC/
Node 2: python imagenet_multi_node.py -a resnet50 --dist-url tcp://127.0.0.1:35693 --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 1 /home2/coremax/Documents/ILSVRC/Data/CLS-LOC/
Traceback
(multi_node) coremax@VM0403165230:~/Documents/GridMask/imagenet_grid$ python imagenet_multi_node.py -a resnet50 --dist-url tcp://127.0.0.1:57095 --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 /home2/coremax/Documents/ILSVRC/Data/CLS-LOC/
Use GPU: 0 for training