How does one run PyTorch on a A40 GPU without errors (with DDP too)?

Question

I tried running my pytorch code but got this error:

A40 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the A40 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Using backend: pytorch
/home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2/lib/python3.8/site-packages/torch/cuda/__init__.py:104: UserWarning: 
A40 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the A40 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "/home/miranda9/ML4Coq/ml4coq-proj-src/embeddings_zoo/tree_nns/main_brando.py", line 305, in <module>
    main_distributed()
  File "/home/miranda9/ML4Coq/ml4coq-proj-src/embeddings_zoo/tree_nns/main_brando.py", line 201, in main_distributed
    mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
  File "/home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/miranda9/ML4Coq/ml4coq-proj-src/embeddings_zoo/tree_nns/main_brando.py", line 210, in train
    setup_process(opts, rank, master_port=opts.master_port, world_size=opts.world_size)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch/distributed.py", line 165, in setup_process
    dist.init_process_group(backend, rank=rank, world_size=world_size)
  File "/home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607369981906/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8

but then it sends me to download it for my mac...? which is weird. What version of pytorch, cuda, cudnn, nccl and other things do I need for a GPU A40?

to see the code I ran and conda env info see this: https://github.com/pytorch/pytorch/issues/58794

related links

You either need to find a prebuilt PyTorch with support for your Ampere GPU (built against CUDA 11.1 or newer) or you need to build it yourself with the correct support included. Whether any of that is possible right now is purely a function of the state of development of PyTorch and nothing to do with CUDA or your GPU per se. — talonmies, May 22 '21 at 02:42
NVIDIA's latest [NGC pytorch container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) will run on A40. Furthermore the [container documentation](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) gives a recipe to describe all the versions you asked about. Looks like cc8.6 support went at least as far back as [20.10 containers](https://lambdalabs.com/blog/nvidia-rtx-a6000-benchmarks/) — Robert Crovella, May 22 '21 at 14:42
note that trying a pip install might do the trick: `pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html` for some reason in some HPCs conda doesn't install super well always. — Charlie Parker, Sep 27 '21 at 15:23

Charlie Parker · Answer 1 · 2021-09-17T21:54:21.530

My guess is the following:

A40 gpus have CUDA capability of sm_86 and they are only compatible with CUDA >= 11.0. But CUDA >= 11.0 is only compatible with PyTorch >= 1.7.0 I believe.

So do:

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch

or

conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

or

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch

if you are in an HPC you might want to do:

module load gcc/9.2.0

#module load cuda-toolkit/10.2
module load cuda-toolkit/11.1

this seemed to work:

(metalearning) miranda9~/automl-meta-learning $ python -c "import uutils; uutils.torch_uu.gpu_test()"


device name: A40

Success, no Cuda errors means it worked see:
out=tensor([[2.3272],
        [5.6796]], device='cuda:0')
(metalearning) miranda9~/automl-meta-learning $ 
(metalearning) miranda9~/automl-meta-learning $ 
(metalearning) miranda9~/automl-meta-learning $ 
(metalearning) miranda9~/automl-meta-learning $ conda list | grep torch
_pytorch_select           0.1                       cpu_0  
pytorch                   1.7.1           py3.9_cuda11.0.221_cudnn8.0.5_0    pytorch
torchaudio                0.7.2                      py39    pytorch
torchmeta                 1.7.0                    pypi_0    pypi
torchvision               0.8.2           cpu_py39ha229d99_0

perhaps worth trying with torchtext `conda install -y pytorch==1.9 torchvision torchaudio torchtext cudatoolkit=11.0 -c pytorch -c nvidia` — Charlie Parker, Sep 17 '21 at 22:03
note: I couldn't get the torchtext command to work unfortuantely. Details: https://stackoverflow.com/questions/69229975/how-does-one-install-torchtext-with-cuda-11-0-and-pytorch-1-9 — Charlie Parker, Sep 17 '21 at 22:32
Have you tryied conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge as suggested as a possible solution https://github.com/pytorch/pytorch/issues/45028#issuecomment-880321190 ? — Sys.Overdrive, Sep 24 '21 at 11:19

score 0 · Answer 2 · answered Feb 26 '23 at 16:16

I encountered the same problem i resolved it by ensurign that cuda and pytorch versions was compatible. So found my cuda version and then used :https://pytorch.org/get-started/locally/ to find the proper version and then installed using conda. With this GPU u have to use 1.7.0 or newer

conda install pytorch==1.7.0

How does one run PyTorch on a A40 GPU without errors (with DDP too)?

2 Answers2