-1

I have build pytorch 2.0.1 from source. Using cuda 11.7, cudnn v8, and the driver for the nvidia GPU is 515.43.04 (CUDA version 11.7). Altough Pytorch seems to build successfully when I am trying to run examples downloaded from github I see the following error which is related to cuDNN:

CUDA available! Training on GPU.
terminate called after throwing an instance of 'c10::Error'
  what():  GET was unable to find an engine to execute this computation
Exception raised from run_single_conv at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:671 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7edfcb24d7 in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f7edfc7c434 in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0xe4314c (0x7f7e9cc3d14c in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xe433eb (0x7f7e9cc3d3eb in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xe27dba (0x7f7e9cc21dba in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x96 (0x7f7e9cc22406 in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x2b16b97 (0x7f7e9e910b97 in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x2b16c50 (0x7f7e9e910c50 in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x23d (0x7f7ec4780ecd in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x1515 (0x7f7ec3adec45 in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python        3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2c434c6 (0x7f7ec4b004c6 in /tmp/manospavl/anaconda/envs/pytorch-dev/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x2c43547 (0x7f7ec4b00547 in /tmp/manospavl/anaconda/envs/pytorch-dev                                                             

I have tried the most recent version of pytorch 2.1.0 and other examples but all seem to produce the same error. Additionally, I have written two simple examples that work. I have also check the cudnn and exists in my setup.

MANOS
  • 21
  • 8
  • Does this problem happen on a specific example? Does your "simple examples" contain the Conv_v8, run_single_conv function? It sounds like you should narrow this down to a specific operation (or type of operation) that isn't working. – matt May 16 '23 at 07:50
  • Now I found that if I install the requirements (installing torch and torchvision), script included in mnist (python version) it changes the pytorch path. Before installing it the pytorch path was /tmp/pytorch after installing it /tmp/anaconda/envs/pytorch-dev/lib/python3.9/site-packages. With the first path the cpp version of mnist works, with the second it does not. – MANOS May 16 '23 at 10:37

1 Answers1

1

The issue was that there was a local installed PyTorch.

MANOS
  • 21
  • 8