I was working on a project that involves captioning. I wanted to use a model I found on github to run inferences. But the problem is in the main file they used distributed training to train on multiple gpus and I have only 1.
torch.distributed.init_process_group(backend="nccl")
They used this to initiate and
world_size = torch.distributed.get_world_size()
torch.cuda.set_device(args.local_rank)
args.world_size = world_size
rank = torch.distributed.get_rank()
args.rank = rank
this to setup world size and rank.
At first I tried python -m torch.distributed.launch caption.py <other arguments>
Its showing me this error
Distributed package doesn't have NCCL built in
warnings.warn(
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LAPTOP-NOUPN41C]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LAPTOP-NOUPN41C]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LAPTOP-NOUPN41C]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LAPTOP-NOUPN41C]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "d:\iitkgp\CLIP4IDC\CLIP4IDC-master\main_task_caption.py", line 28, in <module>
torch.distributed.init_process_group(backend="nccl")
File "D:\Anaconda\envs\CLIP4IDC\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
default_pg = _new_process_group_helper(
File "D:\Anaconda\envs\CLIP4IDC\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3020) of binary: D:\Anaconda\envs\CLIP4IDC\python.exe
Traceback (most recent call last):
File "D:\Anaconda\envs\CLIP4IDC\lib\runpy.py", line 197, in _run_module_as_main
Then I tried to comment out lines that uses ditributed launch but that resulted in the same error.