0

I've bought a new laptop, a SKIKK LYNX II 15" to be exact. On my previous notebook i could install Anaconda personal, then run conda install tensorflow-gpu, and it would detect the GT650M inside that machine (lenovo ideapad Y500) and use it to train my keras sequential models that contain CuDNNLSTM layers.

I did the same steps on this new machine, which comes with not only thr RTX2070, but also Intel UHD graphics. When i first tested the performance of this new machine against my old machine it struck me as odd that it was only marginally faster. Afterall, it should be running circles around it right? (my old laptop did not show performance statistics in task management for the integrated graphics unit, but the new one does)

So thats when i discovered that anaconda did not install tensorflow version 2.1, but rather version 1.14. I then uninstalled those versions and reinstalled using pip install tensorflow-gpu. This command found version 2.2 and installed that. This is where i realized that it wasn't detecting my RTX2070 GPU at all and (likely based on another post I read but lost due to the frequent restarts) running out of memory in the Intel UHD graphics.

In between I thought that I should install cuda version 10.1, I did so and tried reinstalling using conda and pip but no luck. Right now I'm at a loss on how to combat this issue.

To be clear, I'm not even entirely sure what the issue is. But it must have something to do with the dual graphics card config in this laptop and the fact that conda cannot find a tensorflow version higher than 1.14. Any help would be greatly appreciated, if I can't fix this I'm going to have to try and get my money back somehow :S.

talonmies
  • 70,661
  • 34
  • 192
  • 269
XiB
  • 620
  • 6
  • 19
  • Follow the sequence: 1. Update your nvidia driver 2. Install the cuda tool kit version and cudnn that are compatible with your gpu driver version. See the compatibility table here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-application-compatibility – Bernad Peter Jun 26 '20 at 01:56
  • Thank you @BernadPeter for your rapid reply, i went to bed last night not expecting anyone to reply that fast! I've spent the morning deleting everything i could find related to NVIDIA drivers, CUDA, CuDNN and Anaconda. I then did as you suggested and installed the latest NVIDIA gpu driver (version 451.48), i then went to NVIDIA's cuda download page and downloaded the latest CUDA version (11.01.451.22), which came bundled with NVIDIA gpu driver version 451.22 and i told it to install everything. It did not override my more recent nvidia gpu driver though. – XiB Jun 26 '20 at 10:44
  • Not to be deterred i reinstalled Anaconda3 for python 3.7 and opened up the anaconda command prompt, full of hope i typed `conda install tensorflow-gpu` but it offered to, again, install version 1.14. in this post [link](https://stackoverflow.com/questions/54271094/conda-install-c-conda-forge-tensorflow-just-stuck-in-solving-environment) i learned that you can also install packages though the gaphical user interface. So i went there and low and behold it lists tensorflow-gpu version 2.1 as available. But when i click to install, instead it offers to install 1.14 on this machine. – XiB Jun 26 '20 at 10:47
  • I ran `tf.test.gpu_device_name()` in the conda command prompt after installing tensorflow-gpu 2.2 using `pip install tensorflow-gpu` and it gave me a lot more output than before inside jupyter notebook. I can't post everything here in a single comment, so i put the output in an answer to my own question. – XiB Jun 26 '20 at 10:54
  • Which OS you are using windows or Linux? – Bernad Peter Jun 26 '20 at 12:24
  • Oh, sorry totally forgot to mention, windows 10 – XiB Jun 26 '20 at 12:44
  • Hello @BernadPeter, I'm pretty sure i fixed it, i updated my answer below to include what worked for me, i also described an interesting artefact of the task manager while training a model on the GPU. I cannot thank you enough for your reply, even though it didn't directly lead to a solution it really helped me work through the issue! – XiB Jun 26 '20 at 14:04
  • Great! happy machine learning! Give a thumps up! – Bernad Peter Jun 26 '20 at 14:28
  • Nooby question incoming, but how? – XiB Jun 27 '20 at 21:54

1 Answers1

0
>>> tf.test.gpu_device_name()
2020-06-26 12:51:16.233616: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-06-26 12:51:16.240293: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x20b2b3973a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-26 12:51:16.240366: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-26 12:51:16.243205: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-06-26 12:51:17.442337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.455GHz coreCount: 36 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 327.88GiB/s
2020-06-26 12:51:17.443123: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-06-26 12:51:17.443845: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cublas64_10.dll'; dlerror: cublas64_10.dll not found
2020-06-26 12:51:17.530901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-06-26 12:51:17.562869: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-06-26 12:51:17.800865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-06-26 12:51:17.801663: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cusparse64_10.dll'; dlerror: cusparse64_10.dll not found
2020-06-26 12:51:17.802909: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudnn64_7.dll'; dlerror: cudnn64_7.dll not found
2020-06-26 12:51:17.802957: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-06-26 12:51:17.893040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-26 12:51:17.893174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-06-26 12:51:17.894354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-06-26 12:51:17.904105: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x20b373ab8f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-26 12:51:17.904246: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5

From this I conclude that I've installed the wrong version of CUDA (no idea how i managed to mess that up after you posted the compatibility table).

To me it seems like conda cannot find the required drivers, or cuda or cudnn or whatever it needs to install tensorflow-gpu version 2.1, and when i install with pip it ignores those remarks and produces the below error when i search for my gpu.

EDIT UPDATE: I read that Anaconda's tensorflow-gpu automatically downloads all cuda and cudnn dependencies and installs them in your environment. I was working from the base evironment that, as described above, wouldn't install a tensorflow-gpu version above 1.14, and that version wouldn't detect my RTX2070.

I think i solved the problem by very simply creating a new environment (NOT CLONE, A NEW ENVIRONMENT WITH NEXT TO NO PACKAGES) from the Anaconda GUI, opening up a terminal and installing tensorflow-gpu. It offered to install version 2.1 along with a much longer list of dependencies, including CUDA and CuDNN. After installing numpy, pandas, matplotlib and jupyter in the new environment and confirming that tf.test.gpu_device_name() did in fact succesfully detect my RTX2070 and load the relevant cuda libraries (see output below) i was able to put the RTX2070 to work to train my CuDNNLSTM based model for 600 epochs in the time it took my old laptop to complete 20 epochs.

I would like to thank @Bernad Peter for his fast replies and hopefull comments, last night i was ready to pack it all up, send the new machine back and crawl into a corner. As of right now I'm confident that I can utilize the resources in this machine to their full extent!

One last thing that struck me as odd though, this laptop came with included software to manage the lights, battery and fan settings, and it has a page called computer management, where i can see the utilization of resources inside the machine. On that page, while training my model it showed the GPU at 74 degree C and 68% load. While in windows task management the GPU temp was the same but the load was reported at 2%. If it persists I'll mention it again the next time i run into trouble.

Output of the tf.test.gpu_device_name() command after new anaconda environment creation and tensorflow-gpu installation:

>>> tf.test.gpu_device_name()
2020-06-26 14:49:39.345799: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2020-06-26 14:49:39.349151: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-06-26 14:49:39.585934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.455GHz coreCount: 36 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 327.88GiB/s
2020-06-26 14:49:39.586034: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-06-26 14:49:39.592732: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-06-26 14:49:39.597704: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-06-26 14:49:39.599091: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-06-26 14:49:39.604811: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-06-26 14:49:39.607991: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-06-26 14:49:39.620572: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-06-26 14:49:39.620837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-26 14:49:40.196301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-26 14:49:40.196439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0
2020-06-26 14:49:40.197656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N
2020-06-26 14:49:40.200924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/device:GPU:0 with 6719 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
'/device:GPU:0'
XiB
  • 620
  • 6
  • 19