0

I am training a detectron2 model on google cloud platform and want to run thin model on 4 gpus.

to launch the training i am using:

if __name__ == "__main__":
    launch(main, num_gpus_per_machine=4)

but when i run this model training i get an error: "ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set"

when i was running this training on num_gpus_per_machine=1, it was running fine. what does this mean and how can i solve it?

Brock Brown
  • 956
  • 5
  • 11
  • Have you tried `torchrun your_file.py`? https://stackoverflow.com/questions/56805951/valueerror-error-initializing-torch-distributed-using-env-rendezvous-enviro – Brock Brown May 11 '23 at 16:27

1 Answers1

1

It means that the torch.distributed library is not able to find the MASTER_ADDR environment variable. This variable is used to specify the address of the machine that will be the master of the distributed training process.

To solve this error, you need to set the MASTER_ADDR environment variable to the address of the machine that will be the master. You can do this by running the following command:

export MASTER_ADDR=<master_address>

where <master_address> is the IP address or hostname of the machine that will be the master.

Joevanie
  • 489
  • 2
  • 5