3

After an upgrade to Ubuntu 20.04 from 18.04 Tensorflow is no longer able to use my gpu because it is attempting to mix and load different versions (some 10 and some 11). It is a System76 machine, and I have cuda 10.1 installed from System76 (so it works with the System76 nvidia driver). When running tensorflow the following errors occur:

2021-01-07 18:12:22.584886: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:22.584906: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-07 18:12:23.640665: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-07 18:12:23.641412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-07 18:12:23.669966: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-07 18:12:23.670257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1060 computeCapability: 6.1
coreClock: 1.733GHz coreCount: 10 deviceMemorySize: 5.93GiB deviceMemoryBandwidth: 178.99GiB/s
2021-01-07 18:12:23.670328: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670379: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670425: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.671387: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-07 18:12:23.671667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-07 18:12:23.673022: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-07 18:12:23.673100: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.673245: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-07 18:12:23.673259: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.

Notice all the warnings are for attempting to load version 11 of Cuda but it's only for some of the libraries. The version 10 ones load fine.

This is the output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

This is the output of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P0    26W /  N/A |    585MiB /  6069MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2999      G   /usr/lib/xorg/Xorg                101MiB |
|    0   N/A  N/A      3479      G   /usr/lib/xorg/Xorg                255MiB |
|    0   N/A  N/A      3720      G   /usr/bin/gnome-shell               88MiB |
|    0   N/A  N/A      6487      G   ...AAAAAAAA== --shared-files       45MiB |
|    0   N/A  N/A      6959      G   ...AAAAAAAA== --shared-files       40MiB |
|    0   N/A  N/A     11642      G   ...AAAAAAAA== --shared-files       21MiB |
|    0   N/A  N/A     25206      G   WickrMe                            17MiB |
+-----------------------------------------------------------------------------+

I see that the driver version in the output of nvidia-smi is version 11, but as I understand it, that has nothing to do with cuda runtime. That is simply the version up to which the driver supports. Correct me if I'm wrong.

I have to use version 10 because that is what is supported by System76 and it worked fine prior to the upgrade. I have also tried uninstalling and re-installing Tensorflow via pip3 and no luck.

Does anyone know how get all the libraries in sync to version 10.1? I also tried to manually place the version 11 libraries in place and let Tensorflow use the mixed version (which of course is a bad idea) but it won't recognize them (or I didn't place them properly).

greco.roamin
  • 799
  • 1
  • 6
  • 20
  • 2
    You must use CUDA 11.0 for the current tensorflow release. Your confusion is down to misunderstanding that the contents of CUDA 11.0 are not all versioned as 11.x https://docs.nvidia.com/cuda/archive/11.0/cuda-toolkit-release-notes/index.html#cuda-major-component-versions – talonmies Jan 08 '21 at 00:40
  • @talonmies Thank you. You are correct. Based on your comment, I was able to solve it. – greco.roamin Jan 08 '21 at 15:42

2 Answers2

3

As @talonmies pointed out, I was misunderstanding the versioning system. However, because it's a System76 machine, it was also confounding because System76 uses their own Nvidia driver, and it's not straightforward to install Cuda 11 and Cudnn. I'm posting the answer in case anyone else runs into problems with System76.

First, DO NOT use the System76 install for Cuda and Cudnn. They have their own versions (on their website) so as to be compatible with their Nvidia driver, but they will not work (they are version 10, and TF 2.2+ requires 11). Also, most general Cuda guides will tell you to uninstall/install the Nvida driver first so as to have a clean install, but DO NOT do this if you have a System76 system. Just leave the System76 driver alone. Also, if you have any previous Cuda/Cudnn remove/uninstall all of it.

Go to Nvidia and get their latest Cuda and Cudnn. I used

wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run

Run that with

sudo sh cuda_11.0.2_450.51.05_linux.run

When it runs it will tell you that you have a conflict with the driver package. Ignore that and proceed. When you get to the install menu, UNCHECK "install driver" and continue with the install. When it's done, add to your path

/usr/local/cuda-11.0:/usr/local/cuda-11.0/bin:

You need to add both the cuda root and bin, not just bin (which is different than most general instructions). Source your .bashrc or .profile or wherever you put the path addition (or open a new terminal).

Now install Cudnn.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb

Install it with dpkg. For example (in my case)...

sudo dpkg -i libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb

That's it. Once I completed all that, everything worked fine. Hope that helps some System76 people get through Ununtu 20.04 and Cuda 11 a little easier.

ucsky
  • 442
  • 6
  • 13
greco.roamin
  • 799
  • 1
  • 6
  • 20
  • Thank you so much! However, after installing everything, I do not see nvcc. Oh, wait, this is that typical issue where the permissions are locked down for /usr/local/cuda-. Yes, it all works for me. – labyrinth Oct 22 '21 at 19:20
  • Can confirm with System76 machine running pop os 22.04. Installed Cuda 12.2 (cuda_12.2.0_535.54.03_linux.run). Works! ✅ However, did not have to install libcudnn. – sneilan Jul 28 '23 at 03:48
2

Thank you very much. One of the reasons I have used POP OS is that the Nvidia drivers+cuda/cudnn just worked with tensorflow, until this issue with version 11.0 missing.

One thing I needed to be able in install cuda 11.0 using the recipe above was to install gcc versions 8 :

sudo apt -y install gcc-8 g++-8
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 8
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 8

I really wish POP!_os would provide CUDA 11.0 packages directly.....