nv-nsight-cu-cli caused Tensorflow to fail

Question

I've downloaded the newest Nsight Compute profiling tool and I want to use it to benchmark Tensorflow applications. The code I'm using is here. It runs perfectly fine when I execute it and when I benchmark it with nvprof ./mnist.py it had no problem at all. However, when I try to run it with command sudo ./nv-nsight-cu-cli [path to the file] I get the following error:

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

I suspect that nv-nsight-cu-cli somehow didn't recognized the environment variable at all. Is there any fix around?

Did you fixed that problem? If no, why are you running nsight with `sudo`? Remember you need to use `sudo -E` if you want to preserve your environment variable (like LD_LIBRARY_PATH) — Robin Thoni, Mar 15 '19 at 11:56
@RobinThoni Well I think the tool somehow does not work on Tesla P100. I tried the same program o my GTX1080 and it works perfectly fine. Any idea why it doesn't work on Tesla P100? Btw I used the one provided in cuda 10.1, still no luck — edhu, May 01 '19 at 03:08
@RobinThoni yes it’s a network file system so everything should stay the same — edhu, May 01 '19 at 07:24

score 1 · Answer 1 · answered May 01 '19 at 08:53

1

You need to search for differences in both environments:

env variables
LD_LIBRARY_PATH
/etc/ld.so.conf
/etc/ld.so.conf.d/*
cuBLAS
- Is installation complete/not broken?
- Is it installed at the same location on both machines?
- Versions
...

You can start with locate libcublas.so on both machines to see if there's a difference. Alternatively, you can strace -f -e open the program to check where it tries to libcublas.so from.

Your error has (for now) nothing to do with GPUs: libcublas.so.9.0 can just not be found. Find it, find why Tensorflow can not find it and your problem will be solved.

answered May 01 '19 at 08:53

Robin Thoni

1,651
13
22

I followed the step you mentioned. There is indeed some difference in /etc/ld.so.conf.d/*. I would also like to mention that the command failed regardless of whether I ran Tensorflow or not. It failed on every cuda program, even the simple ones I created on my own without other lib dependencies. To be more specific, I tried the matrixMul program from cuda-samples. I can see the program started executing but as soon as it reaches a cuda call it returns error code 11. Any idea what this error represents? – edhu May 01 '19 at 17:33
After some careful searching I do find the answer here: https://devtalk.nvidia.com/default/topic/1045430/nsight-compute-/partial-profiling/ It appears GP100 is not supported for some reason. But thanks for the reply. – edhu May 01 '19 at 18:17
Oh yeah, sorry, I did not even pay attention to what GPU it was, as the original problem was about loading a shared library... – Robin Thoni May 01 '19 at 18:24

score 0 · Accepted Answer · answered May 01 '19 at 18:20

0

It appears that GP100 is not supported by the tool at this moment. The answer is found here:

Nsight Compute only supports Pascal (other than GP100) and later GPUs.

answered May 01 '19 at 18:20

edhu

449
6
23

nv-nsight-cu-cli caused Tensorflow to fail

2 Answers2