5

I have a problem where

import torch
print(torch.cuda_is_available())

will print False, and I can't use the GPU available. I've tried it on conda environment, where I've installed the PyTorch version corresponding to the NVIDIA driver I have. I've also tried it in docker container, where I've done the same. I've tried both of these options on a remote server, but they both failed. I know that I've installed the correct driver versions because I've checked the version with nvcc --version before installing PyTorch, and I've checked the GPU connection with nvidia-smi which displays the GPUs on the machines correctly.

Also, I've checked this post and tried exporting CUDA_VISIBLE_DEVICES, but had no luck.

On the server I have NVIDIA V100 GPUs with CUDA version 10.0 (for conda environment) and version 10.2 on a docker container I've built. Any help or push in the right direction would be greatly appreciated. Thanks!

WannabeArchitect
  • 1,058
  • 2
  • 11
  • 22
  • 1
    Which version of PyTorch did you try to use? What is your `nvidia-smi` output? – Berriel Jul 04 '20 at 13:53
  • @Berriel `nvidia-smi` output is too long to write here. It's basically 8 NVIDIA V100 GPUs, from #0 to #7. The normal stuff I think you would see on other `nvidia-smi` outputs. For the conda environment with CUDA 10.0, it says `torch.__version__` is `1.4.0` and for the docker container with CUDA 10.2, it says `torch.__version__` is `1.5.0a0+8f84ded`... I'm assuming that's `1.5.0` – WannabeArchitect Jul 04 '20 at 14:00
  • 1
    The relevant part of the `nvidia-smi` would be the header :) nvidia driver version, basically. If the driver is compatible, it should work. BTW, the cuda version of the docker or your system are kind of irrelevant, because PyTorch is delivered with its own cuda. – Berriel Jul 04 '20 at 14:05
  • @Berriel They both say Driver Version 410.129 and CUDA Version 10.0. Just out of curiosity, if my CUDA version doesn't matter, why do I have to choose which CUDA version I'm using when I get the download links from places like https://pytorch.org/? – WannabeArchitect Jul 04 '20 at 14:07
  • Because it is easier to point out CUDA versions than NVIDIA driver compatibility :) Can you post the `torch.version.cuda` of both your conda and docker envs? With this driver, only PyTorch with cuda < 10.1 will work. – Berriel Jul 04 '20 at 14:19
  • @Berriel for docker it is 10.2 and for conda it is 10.0... I'm guessing I should try changing the version for docker and see if it works? – WannabeArchitect Jul 04 '20 at 14:33
  • 1
    Yes 10.2 won't work with that driver version, but your conda env should be working. Try to install PyTorch for 9.2 just in case. Consider asking a sudo user (if you're not) to update the driver as well. Those V100s can benefit from newer cuda versions. – Berriel Jul 04 '20 at 14:53

1 Answers1

-2

For anyone else having this problem, it turned out my server manager has not updated the drivers for the server.

I switched to a different server, installed anaconda and things started working like it should, i.e., torch.cuda.is_available() returns True after setting up a fresh environment.

WannabeArchitect
  • 1,058
  • 2
  • 11
  • 22