0

I am working on multiple machines and a single machine consists of two GPUs same as for the second machine. Overall, I have 4 GPUs in two machines. I am following the official example of PyTorch to train imagenet dataset. When I start the training on both machines the process is in the waiting stage and does not start training.

TCP Port

netstat -ntpl

Output

(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 10.246.246.48:51651     0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:22443           0.0.0.0:*               LISTEN      -                   
tcp        0      0 10.246.246.48:37707     0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:42803           0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:6100            0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:5432          0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:35129         0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:6010          0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:6011          0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:58843           0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::39047                :::*                    LISTEN      -                   
tcp6       0      0 :::22443                :::*                    LISTEN      -                   
tcp6       0      0 :::35693                :::*                    LISTEN      14644/python        
tcp6       0      0 :::111                  :::*                    LISTEN      -                   
tcp6       0      0 :::6100                 :::*                    LISTEN      -                   
tcp6       0      0 :::22                   :::*                    LISTEN      -                   
tcp6       0      0 ::1:631                 :::*                    LISTEN      -                   
tcp6       0      0 127.0.0.1:46617         :::*                    LISTEN      -                   
tcp6       0      0 ::1:6010                :::*                    LISTEN      -                   
tcp6       0      0 ::1:6011                :::*                    LISTEN      -                   
tcp6       0      0 :::50907                :::*                    LISTEN      -    

Code Examples https://github.com/pytorch/examples/tree/main/imagenet

Node 1: python imagenet_multi_node.py -a resnet50 --dist-url tcp://127.0.0.1:35693 --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 /home2/coremax/Documents/ILSVRC/Data/CLS-LOC/

Node 2: python imagenet_multi_node.py -a resnet50 --dist-url tcp://127.0.0.1:35693 --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 1 /home2/coremax/Documents/ILSVRC/Data/CLS-LOC/

Traceback

(multi_node) coremax@VM0403165230:~/Documents/GridMask/imagenet_grid$ python imagenet_multi_node.py -a resnet50 --dist-url tcp://127.0.0.1:57095 --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 /home2/coremax/Documents/ILSVRC/Data/CLS-LOC/
Use GPU: 0 for training
Khawar Islam
  • 2,556
  • 2
  • 34
  • 56

0 Answers0