2

I have installed in Windows 10 with WSL2 (Ubuntu 22.04 Kernel), the Tensorflow 2.12, Cuda Toolkit 11.8.0 and cuDNN 8.6.0.163 in Miniconda environment (Python 3.9.16), normally and as the official tensorflow.org recommend. I should emphasize at this point that I want to use Tensorflow 2.12 because with the correspond Cuda Toolkit 11.8.0 it is compatible with Ada Lovelace GPUs (RTX4080 for my case).

When I go to train my model, it gives me the following error:

"Loaded cuDNN version 8600 Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so : cannot open shared object file: No such file or directory".

Is there any idea that is going wrong*?

The paths were configured as follows:

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

The files referring to my error were searched for using the following commands:

  • ldconfig -p | grep libcudnn_cnn but it returned nothing so the file does not exist, and
  • ldconfig -p | grep libcuda where returned libcuda.so.1 (libc6,x86-64) => /usr/lib/wsl/lib/libcuda.so.1

Also, I have try to set the new environmental variable and include that to $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh but without any luck:

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

*Note that when importing the Tensorflow, I get the following warnings:

  • TF-TRT Warning: Could not find TensorRT
  • could not open file to read NUMA node:
        /sys/bus/pci/devices/0000:1c:00.0/numa_node Your kernel may have been built without NUMA support.
    

In addition, an attempt to follow the NVIDIA Documentation for WSL, specific in section 3 -> Option 1, but this does not solve the problem.

2 Answers2

7

Ran into this problem and found a working solution after a lot of digging around.

First, the missing libcuda.so can be solved by the method proposed here: https://github.com/microsoft/WSL/issues/5663#issuecomment-1068499676

Essentially rebuilding the symbolic links in the CUDA lib directory:

> cd \Windows\System32\lxss\lib
> del libcuda.so
> del libcuda.so.1
> mklink libcuda.so libcuda.so.1.1
> mklink libcuda.so.1 libcuda.so.1.1

(this is done in an admin elevated Command Prompt shell)

Then when you run into the missing device problem (which you undoubtfully will), solve it by: https://github.com/tensorflow/tensorflow/issues/58681#issuecomment-1406967453

Which boils down to:

$ mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice/
$ cp -p $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/
$ export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib

And

$ conda install -c nvidia cuda-nvcc --yes

(verify by ptxas --version)

If you're running notebooks in VSCode remote WSL then you'd need to add export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib to /$CONDA_PREFIX/etc/conda/activate.d/env_vars.sh (this is good practice anyway)

Roy Shilkrot
  • 3,079
  • 29
  • 25
0

Just did the "rebuilding the symbolic links in the CUDA lib directory" part and it works for my case. Should I go for part below?

  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/34605963) – juanpethes Jul 02 '23 at 15:47