0

I've downloaded the newest Nsight Compute profiling tool and I want to use it to benchmark Tensorflow applications. The code I'm using is here. It runs perfectly fine when I execute it and when I benchmark it with nvprof ./mnist.py it had no problem at all. However, when I try to run it with command sudo ./nv-nsight-cu-cli [path to the file] I get the following error:

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

I suspect that nv-nsight-cu-cli somehow didn't recognized the environment variable at all. Is there any fix around?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
edhu
  • 449
  • 6
  • 23
  • Did you fixed that problem? If no, why are you running nsight with `sudo`? Remember you need to use `sudo -E` if you want to preserve your environment variable (like LD_LIBRARY_PATH) – Robin Thoni Mar 15 '19 at 11:56
  • @RobinThoni Well I think the tool somehow does not work on Tesla P100. I tried the same program o my GTX1080 and it works perfectly fine. Any idea why it doesn't work on Tesla P100? Btw I used the one provided in cuda 10.1, still no luck – edhu May 01 '19 at 03:08
  • Did you run on the same machine and environment? – Robin Thoni May 01 '19 at 07:16
  • @RobinThoni yes it’s a network file system so everything should stay the same – edhu May 01 '19 at 07:24
  • But it's a different machine? – Robin Thoni May 01 '19 at 07:24
  • @RobinThoni not the same machine unfortunately – edhu May 01 '19 at 08:03

2 Answers2

1

You need to search for differences in both environments:

  • env variables
  • LD_LIBRARY_PATH
  • /etc/ld.so.conf
  • /etc/ld.so.conf.d/*
  • cuBLAS
    • Is installation complete/not broken?
    • Is it installed at the same location on both machines?
    • Versions
  • ...

You can start with locate libcublas.so on both machines to see if there's a difference. Alternatively, you can strace -f -e open the program to check where it tries to libcublas.so from.

Your error has (for now) nothing to do with GPUs: libcublas.so.9.0 can just not be found. Find it, find why Tensorflow can not find it and your problem will be solved.

Robin Thoni
  • 1,651
  • 13
  • 22
  • I followed the step you mentioned. There is indeed some difference in /etc/ld.so.conf.d/*. I would also like to mention that the command failed regardless of whether I ran Tensorflow or not. It failed on every cuda program, even the simple ones I created on my own without other lib dependencies. To be more specific, I tried the matrixMul program from cuda-samples. I can see the program started executing but as soon as it reaches a cuda call it returns error code 11. Any idea what this error represents? – edhu May 01 '19 at 17:33
  • After some careful searching I do find the answer here: https://devtalk.nvidia.com/default/topic/1045430/nsight-compute-/partial-profiling/ It appears GP100 is not supported for some reason. But thanks for the reply. – edhu May 01 '19 at 18:17
  • Oh yeah, sorry, I did not even pay attention to what GPU it was, as the original problem was about loading a shared library... – Robin Thoni May 01 '19 at 18:24
0

It appears that GP100 is not supported by the tool at this moment. The answer is found here:

Nsight Compute only supports Pascal (other than GP100) and later GPUs.

edhu
  • 449
  • 6
  • 23