0

I was working on a project that involves captioning. I wanted to use a model I found on github to run inferences. But the problem is in the main file they used distributed training to train on multiple gpus and I have only 1.

torch.distributed.init_process_group(backend="nccl")

They used this to initiate and

world_size = torch.distributed.get_world_size()
    torch.cuda.set_device(args.local_rank)
    args.world_size = world_size
    rank = torch.distributed.get_rank()
    args.rank = rank

this to setup world size and rank.

At first I tried python -m torch.distributed.launch caption.py <other arguments> Its showing me this error

Distributed package doesn't have NCCL built in

 warnings.warn(
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LAPTOP-NOUPN41C]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LAPTOP-NOUPN41C]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LAPTOP-NOUPN41C]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LAPTOP-NOUPN41C]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
  File "d:\iitkgp\CLIP4IDC\CLIP4IDC-master\main_task_caption.py", line 28, in <module>
    torch.distributed.init_process_group(backend="nccl")
  File "D:\Anaconda\envs\CLIP4IDC\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
    default_pg = _new_process_group_helper(
  File "D:\Anaconda\envs\CLIP4IDC\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3020) of binary: D:\Anaconda\envs\CLIP4IDC\python.exe
Traceback (most recent call last):
  File "D:\Anaconda\envs\CLIP4IDC\lib\runpy.py", line 197, in _run_module_as_main

Then I tried to comment out lines that uses ditributed launch but that resulted in the same error.

0 Answers0