43

I was trying to use my current code with an A100 gpu but I get this error:

---> backend='nccl'
/home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2/lib/python3.8/site-packages/torch/cuda/__init__.py:104: UserWarning: 
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

which is reather confusing because it points to the usual pytorch installation but doesn't tell me which combination of pytorch version + cuda version to use for my specific hardware (A100). What is the right way to install pytorch for an A100?


These are some versions I've tried:

# conda install -y pytorch==1.8.0 torchvision cudatoolkit=10.2 -c pytorch
# conda install -y pytorch torchvision cudatoolkit=10.2 -c pytorch
#conda install -y pytorch==1.7.1 torchvision torchaudio cudatoolkit=10.2 -c pytorch -c conda-forge
# conda install -y pytorch==1.6.0 torchvision cudatoolkit=10.2 -c pytorch
#conda install -y pytorch==1.7.1 torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

# conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
# conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge
# conda install -y pytorch torchvision cudatoolkit=9.2 -c pytorch # For Nano, CC
# conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

note that this can be subtle because I've had this error with this machine + pytorch version in the past:

How to solve the famous `unhandled cuda error, NCCL version 2.7.8` error?


Bonus 1:

I still have errors:

ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1423, in <module>
    main()
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1365, in main
    train(args=args)
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1385, in train
    args.opt = move_opt_to_cherry_opt_and_sync_params(args) if is_running_parallel(args.rank) else args.opt
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/distributed.py", line 456, in move_opt_to_cherry_opt_and_sync_params
    args.opt = cherry.optim.Distributed(args.model.parameters(), opt=args.opt, sync=syn)
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 62, in __init__
    self.sync_parameters()
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 78, in sync_parameters
    dist.broadcast(p.data, src=root)
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8

one of the answers suggested to have nvcca & pytorch.version.cuda to match but they do not:

(meta_learning_a100) [miranda9@hal-dgx ~]$ python -c "import torch;print(torch.version.cuda)"

11.1
(meta_learning_a100) [miranda9@hal-dgx ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

How do I match them? I this the error? Can someone display their pip, conda and nvcca version to see what set up works?

More error messages:

hal-dgx:21797:21797 [0] NCCL INFO Bootstrap : Using [0]enp226s0:141.142.153.83<0> [1]virbr0:192.168.122.1<0>
hal-dgx:21797:21797 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hal-dgx:21797:21797 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB ; OOB enp226s0:141.142.153.83<0>
hal-dgx:21797:21797 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda11.1
hal-dgx:21805:21805 [2] NCCL INFO Bootstrap : Using [0]enp226s0:141.142.153.83<0> [1]virbr0:192.168.122.1<0>
hal-dgx:21799:21799 [1] NCCL INFO Bootstrap : Using [0]enp226s0:141.142.153.83<0> [1]virbr0:192.168.122.1<0>
hal-dgx:21805:21805 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hal-dgx:21799:21799 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hal-dgx:21811:21811 [3] NCCL INFO Bootstrap : Using [0]enp226s0:141.142.153.83<0> [1]virbr0:192.168.122.1<0>
hal-dgx:21811:21811 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hal-dgx:21811:21811 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB ; OOB enp226s0:141.142.153.83<0>
hal-dgx:21811:21811 [3] NCCL INFO Using network IB
hal-dgx:21799:21799 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB ; OOB enp226s0:141.142.153.83<0>
hal-dgx:21805:21805 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB ; OOB enp226s0:141.142.153.83<0>
hal-dgx:21799:21799 [1] NCCL INFO Using network IB
hal-dgx:21805:21805 [2] NCCL INFO Using network IB

hal-dgx:21797:27906 [0] misc/ibvwrap.cc:280 NCCL WARN Call to ibv_create_qp failed
hal-dgx:21797:27906 [0] NCCL INFO transport/net_ib.cc:360 -> 2
hal-dgx:21797:27906 [0] NCCL INFO transport/net_ib.cc:437 -> 2
hal-dgx:21797:27906 [0] NCCL INFO include/net.h:21 -> 2
hal-dgx:21797:27906 [0] NCCL INFO include/net.h:51 -> 2
hal-dgx:21797:27906 [0] NCCL INFO init.cc:300 -> 2
hal-dgx:21797:27906 [0] NCCL INFO init.cc:566 -> 2
hal-dgx:21797:27906 [0] NCCL INFO init.cc:840 -> 2
hal-dgx:21797:27906 [0] NCCL INFO group.cc:73 -> 2 [Async thread]

hal-dgx:21811:27929 [3] misc/ibvwrap.cc:280 NCCL WARN Call to ibv_create_qp failed
hal-dgx:21811:27929 [3] NCCL INFO transport/net_ib.cc:360 -> 2
hal-dgx:21811:27929 [3] NCCL INFO transport/net_ib.cc:437 -> 2
hal-dgx:21811:27929 [3] NCCL INFO include/net.h:21 -> 2
hal-dgx:21811:27929 [3] NCCL INFO include/net.h:51 -> 2
hal-dgx:21811:27929 [3] NCCL INFO init.cc:300 -> 2
hal-dgx:21811:27929 [3] NCCL INFO init.cc:566 -> 2
hal-dgx:21811:27929 [3] NCCL INFO init.cc:840 -> 2
hal-dgx:21811:27929 [3] NCCL INFO group.cc:73 -> 2 [Async thread]

after putting

import os
os.environ["NCCL_DEBUG"] = "INFO"
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
  • 2
    Pytorch 1.7.0 or later with CUDA 11.0 or later [should work](https://www.marktechpost.com/2020/11/01/pytorch-releases-version-1-7-with-new-features-like-cuda-11-new-apis-for-ffts-and-nvidia-a100-generation-gpus-support/). Or you could use [NGC](https://ngc.nvidia.com) – Robert Crovella Apr 07 '21 at 19:16
  • 1
    @RobertCrovella if what you say it's true then the command needed is `conda install -y pytorch==1.7.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch -c conda-forge` will try soon if it worked. – Charlie Parker Apr 24 '21 at 01:38
  • @CharlieParker just 14 minutes for the expiration of Bounty. non of these answers helpful? – Sadra May 19 '22 at 20:40

6 Answers6

46

From the link pytorch site from @SimonB 's answer, I did:

pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

This solved the problem for me.

James Hirschorn
  • 7,032
  • 5
  • 45
  • 53
  • 1
    For me, the conda installation did not work but the pip installation, no idea why – Woma Aug 25 '21 at 11:32
  • Somehow, it does not work for now, seems that the download link not work for some reason: "returned a non-zero code: 137" – Tian Jan 28 '22 at 01:53
  • are you sure this is right? I have that my pytorch wants 11.1 but nvcca is 11.0 see: `(meta_learning_a100) [miranda9@hal-dgx ~]$ python -c "import torch;print(torch.version.cuda)" 11.1 (meta_learning_a100) [miranda9@hal-dgx ~]$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Wed_Jul_22_19:09:09_PDT_2020 Cuda compilation tools, release 11.0, V11.0.221 Build cuda_11.0_bu.TC445_37.28845127_0 ` – Charlie Parker May 12 '22 at 20:43
  • @CharlieParker It's been too long for me to recall the context for this question. But `nvidia-smi` reveals that I have CUDA 11.4 and nvcc 10.1 – James Hirschorn May 16 '22 at 16:26
  • 1
    ERROR: No matching distribution found for torch==1.9.0+cu111 – PascalIv May 15 '23 at 09:12
8

I've got an A100 and have had success with

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

Which is now also recommended on the pytorch site

Simon B
  • 245
  • 1
  • 5
  • are you sure this is right? I have that my pytorch wants 11.1 but nvcca is 11.0 see: `(meta_learning_a100) [miranda9@hal-dgx ~]$ python -c "import torch;print(torch.version.cuda)" 11.1 (meta_learning_a100) [miranda9@hal-dgx ~]$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Wed_Jul_22_19:09:09_PDT_2020 Cuda compilation tools, release 11.0, V11.0.221 Build cuda_11.0_bu.TC445_37.28845127_0 ` – Charlie Parker May 12 '22 at 20:43
4

To me this is what worked:

conda update conda
pip install --upgrade pip
pip3 install --upgrade pip

conda create -n meta_learning_a100 python=3.9
conda activate meta_learning_a100

pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

then I tested it, asked for the device and did a matrix multiply, no errors is it worked:

(meta_learning_a100) [miranda9@hal-dgx diversity-for-predictive-success-of-meta-learning]$ python -c "import uutils; uutils.torch_uu.gpu_test()"
device name: A100-SXM4-40GB
Success, no Cuda errors means it worked see:
out=tensor([[ 0.5877],
        [-3.0269]], device='cuda:0')

gpu pytorch code:

def gpu_test():
    """
    python -c "import uutils; uutils.torch_uu.gpu_test()"
    """
    from torch import Tensor

    print(f'device name: {device_name()}')
    x: Tensor = torch.randn(2, 4).cuda()
    y: Tensor = torch.randn(4, 1).cuda()
    out: Tensor = (x @ y)
    assert out.size() == torch.Size([2, 1])
    print(f'Success, no Cuda errors means it worked see:\n{out=}')
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
  • Thanks! This solution worked for me. I was trying to run pytorch inside docker. I uninstalled all the cuda libraries and pre-installed torch before running this command. – harsh kumar Chourasia Aug 19 '22 at 05:48
3

I had the same problem. You need to install CUDA 11.0 instead of 10.2 and reinstall PyTorch for this CUDA version.

guillaumefrd
  • 129
  • 6
  • did you install pytorch 1.8.0 using cuda 11.0 or pytorch 1.7.x? – Charlie Parker Apr 24 '21 at 00:40
  • 1
    I tried 1.8.0 and 1.7.1, both were working. – guillaumefrd Apr 25 '21 at 06:40
  • are you sure this is right? I have that my pytorch wants 11.1 but nvcca is 11.0 see: `(meta_learning_a100) [miranda9@hal-dgx ~]$ python -c "import torch;print(torch.version.cuda)" 11.1 (meta_learning_a100) [miranda9@hal-dgx ~]$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Wed_Jul_22_19:09:09_PDT_2020 Cuda compilation tools, release 11.0, V11.0.221 Build cuda_11.0_bu.TC445_37.28845127_0 ` – Charlie Parker May 12 '22 at 20:44
2

This solution is tested on a multi GPU A100 environment:

create a clean conda environment: conda create -n pya100 python=3.9

then check your nvcc version by: nvcc --version #mine return 11.3

then install pytorch in this way: (as of now it installs Pytorch 1.11.0, torchvision 0.12.0)

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -c nvidia

now python -c "import torch;print(torch.version.cuda)" returns 11.3 (though I don't think it matters that much)

I shared my environment file Here. You can build one environment based on using this: (just replace NAMEOFENVIRONMENT with your environment name)

conda env update --name NAMEOFENVIRONMENT --file environment.yml     
Sadra
  • 2,480
  • 2
  • 20
  • 32
1

Check your installed version of torch, torchvision, torchaudio etc. using

<your virtualenv path>/bin/python -m torch.utils.collect_env

In my case I had this -

[pip3] numpy==1.21.5
[pip3] torch==1.11.0
[pip3] torchaudio==0.11.0
[pip3] torchtuples==0.2.2
[pip3] torchvision==0.12.0

Since, I was not using torchvision or torchaudio, I just updated my torch version using the suggestion by @JamesHirschorn and selected the one according to my torch version from this pytorch link. e.g. in my case, the torch version was 1.11.0 and hence I installed torch==1.11.0+cu113

pip install torch==1.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html

After update, the output of <your virtualenv path>/bin/python -m torch.utils.collect_env was

[pip3] numpy==1.21.5
[pip3] torch==1.11.0+cu113  <---
[pip3] torchaudio==0.11.0
[pip3] torchtuples==0.2
[pip3] torchvision==0.12.0
rhn89
  • 362
  • 3
  • 11
  • 1
    for me with message: Tesla K20Xm with CUDA capability sm_35 is not compatible with the current PyTorch installation. this works: pip install torch==1.2.0+cu92 -f https://download.pytorch.org/whl/torch_stable.html – user956584 May 04 '22 at 18:03