Why Pytorch 1.7 with cuda10.1 cannot compatible with Nvidia A100 Ampere Architecture (according to PTX compatibilty pricinple)

Question

According to Nvidia official documentation, if CUDA appliation is built to include PTX, because the PTX is forward-compatible, Meaning PTX is supported to run on any GPU with compute capability higher than the compute capability assumed for generation of that PTX. so I try to find whether torch-1.7.0+cu101 is compiled to binary with PTX， and the fact seem like that pytorch actually compiled with nvcc compile flag "-gencode=arch=compute_xx,code=sm_xx" pytorch CMakeLists.txt.I think this flag means after compiling pytorch , the compiled product contains the PTX. However, when I try to use pytorch1.7 with cuda10.1 in A100，there is always error.

>>> import torch
>>> torch.zeros(1).cuda()
/data/miniconda3/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/miniconda3/lib/python3.7/site-packages/torch/tensor.py", line 179, in __repr__
  return torch._tensor_str._str(self)
File "/data/miniconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 372, in _str
return _str_intern(self)
File "/data/miniconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 352, in _str_intern
  tensor_str = _tensor_str(self, indent)
File "/data/miniconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 241, in _tensor_str
  formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/data/miniconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 89, in __init__
  nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: CUDA error: no kernel image is available for execution on the device

so ,i really want to know，why "PTX compatibilty pricinple" does not apply to pytorch. there are other answers which only tell to use cuda11 or higher ，and i know it works.But they don't tell me the real reason -- why pytorch for cuda10.1 does not work for A100. I try use cuda10.1 samples in toolkit, and these small demo applications acctually work.

[Matrix Multiply Using CUDA] - Starting...
MapSMtoCores for SM 8.0 is undefined.  Default to use 64 Cores/SM
GPU Device 0: "A100-SXM4-40GB" with compute capability 8.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 4286.91 GFlop/s, Time= 0.031 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.

If anyone could help me with an answer I would be very grateful

The PTX compatibility mechanism doesn't apply to PyTorch because the PyTorch developers have chosen for it to be that way. They build, package and distribute PyTorch without PTX embedded in the library. Why they do that is a question to ask them — talonmies, Mar 03 '22 at 08:45
thx，but when i check the pytorch setup.py file, there is a flag TORCH_CUDA_ARCH_LIST" , which can specify what classes of NVIDIA hardware we should generate PTX for. https://github.com/pytorch/pytorch/blob/1.7/setup.py. I try to compile the pytorch from source and set TORCH_CUDA_ARCH_LIST, but seems that it does not work @talonmies — Seven link bob, Mar 03 '22 at 09:03
That is not what that does. Their binary builds are stripped of PTX the linking stage. You can either choose to believe or not. — talonmies, Mar 03 '22 at 09:03
thank you ~~. it seems make sence.i believe you ~~ @talonmies — Seven link bob, Mar 03 '22 at 09:32

score -2 · Answer 1 · edited Mar 04 '22 at 07:28

-2

After @talonmies' reminder, I also posted the same question in discuss.pytorch.org.

The answer is because pytorch1.7 uses cuDNN7, which is not compatible with the A100. CuDNN7.6.5 is not supported by the Nvidia Ampere architecture. The only version of cuDNN supported by Ampere is cuDNN 8 or higher.

edited Mar 04 '22 at 07:28

ouflak

2,458
10
44
49

answered Mar 04 '22 at 03:10

Seven link bob

1
2

No this is completely incorrect for two reasons: (a) CUDNN has PTX embedded code and will run on newer hardware (albeit after a lengthen JIT pass) and (b) it is plain in the stack trace that the error is happening within the basic tensor engine in libtorch, which has *nothing* to do with CUDNN. As I have said several times, Pytorch is distributed as a binary with limited architecture support and no embedded PTX from the CUDA code within their codebase by choice – talonmies Mar 05 '22 at 02:43
@talonmies it really confuses me that in the nvidia offical documentation (https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-801-preview/cudnn-support-matrix/index.html ) , although cudnn has PTX embeded code, why the documentation says cudnn7 is not supported in Ampere. – Seven link bob Mar 05 '22 at 15:44
the answer which is provided by by pytorch developer in [pytorch discuss](https://discuss.pytorch.org/t/why-pytorch-1-7-with-cuda10-1-cannot-compatible-with-nvidia-a100-ampere-architecture-according-to-ptx-compatibilty-pricinple/145486/4.) is an totally opposite answer to the answer which is provided in stackoverflow. – Seven link bob Mar 07 '22 at 02:07

Why Pytorch 1.7 with cuda10.1 cannot compatible with Nvidia A100 Ampere Architecture (according to PTX compatibilty pricinple)

1 Answers1