0

I'm getting an CUDNN_STATUS_INTERNAL_ERROR error like below.

python train_v2.py

Traceback (most recent call last):
  File "train_v2.py", line 113, in <module>
    main()
  File "train_v2.py", line 74, in main
    model.cuda()
  File "/home/ahkim/Desktop/squad_vteam/src/model.py", line 234, in cuda
    self.network.cuda()
  File "/home/ahkim/anaconda3/envs/san/lib/python3.6/site-packages/torch/nn/modules/module.py", line 249, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/ahkim/anaconda3/envs/san/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply
    module._apply(fn)
  File "/home/ahkim/anaconda3/envs/san/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply
    module._apply(fn)
  File "/home/ahkim/anaconda3/envs/san/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply
    module._apply(fn)
  File "/home/ahkim/anaconda3/envs/san/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 112, in _apply
    self.flatten_parameters()
  File "/home/ahkim/anaconda3/envs/san/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 105, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

What should I try to resolve this issue? I tried deleting .nv but no success.


nvidia-smi

Wed Aug  8 10:56:29 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.67                 Driver Version: 390.67                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:04:00.0 Off |                  N/A |
| 22%   21C    P8    15W / 250W |    125MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 00000000:05:00.0 Off |                  N/A |
| 22%   24C    P8    14W / 250W |     11MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 00000000:08:00.0 Off |                  N/A |
| 22%   23C    P8    14W / 250W |     11MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 00000000:09:00.0 Off |                  N/A |
| 22%   23C    P8    15W / 250W |     11MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX TIT...  Off  | 00000000:85:00.0 Off |                  N/A |
| 22%   24C    P8    14W / 250W |     11MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX TIT...  Off  | 00000000:86:00.0 Off |                  N/A |
| 22%   23C    P8    15W / 250W |     11MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX TIT...  Off  | 00000000:89:00.0 Off |                  N/A |
| 22%   21C    P8    15W / 250W |     11MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX TIT...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 22%   23C    P8    15W / 250W |     11MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1603      C   /usr/bin/python                              114MiB |
+-----------------------------------------------------------------------------+

Update:

The same code runs without error using Nvidia Driver Version: 396.26 (cuda V9.1.85. torch.backends.cudnn.version(): 7102). I'm getting an error using Driver Version: 390.67 (cuda V9.1.85. torch.backends.cudnn.version(): 7102)

aerin
  • 20,607
  • 28
  • 102
  • 140
  • 1
    how many gpus do you have? What is the output of `nvidia-smi` on your machine? – Robert Crovella Aug 08 '18 at 17:55
  • @RobertCrovella I have 8 gpus on the server. I posted the `nvidia-smi` result. – aerin Aug 08 '18 at 17:59
  • This might require a full test case to identify what the issue is. Is there any particular reason you want to use 5 GPUs? Is there any particular reason you don't want to use GPU 0? Do you get a similar error if you specify `CUDA_VISIBLE_DEVICES=0,1,2,3` ? Why did you edit out the mention of how you are setting that variable in your question? – Robert Crovella Aug 08 '18 at 18:21
  • @RobertCrovella Because even without saying `CUDA_VISIBLE_DEVICES=0,1,2,3`, I was getting the same error. – aerin Aug 08 '18 at 18:22
  • @RobertCrovella When I was writing the question, GPU 0 was used by someone else so I thought there was an issue with `CUDA_VISIBLE_DEVICES` but apparently it's not. – aerin Aug 08 '18 at 18:23
  • OK, your original question had this text in it `I can run it without error if I don't specify CUDA_VISIBLE_DEVICES` (go back and look at your original posting). So that is why I was asking about it. – Robert Crovella Aug 08 '18 at 18:31
  • @RobertCrovella I know. Sorry for the confusion. I was running them on two different servers and mistaken with another server. – aerin Aug 08 '18 at 18:33
  • @RobertCrovella I think it's a driver version issue. The same code runs without error using Nvidia Driver Version: 396.26. I'm getting an error using Driver Version: 390.67 – aerin Aug 08 '18 at 21:08
  • 1
    well you haven't indicated what CUDA or CUDNN version you are using. 390.67 should work with CUDA 9.0 or CUDA 9.1. If you are using CUDA 9.2, yes, you would need a 396.xx or newer driver. – Robert Crovella Aug 08 '18 at 21:16

3 Answers3

2

solved by below steps.

  1. export LD_LIBRARY_PATH= "/usr/local/cuda-9.1/lib64"

  2. Due to nfs issue, have pytoch cache not in nfs. For example:

    $ rm ~/.nv -rf

    $ mkdir -p /tmp/$USER/.nv

    $ ln -s /tmp/$USER/.nv ~/.nv

aerin
  • 20,607
  • 28
  • 102
  • 140
0

Go to pytorch website, and choose the version which satisfies your cuda version https://pytorch.org/

cu100 = cuda 10.0

pip3 uninstall torch
pip3 install https://download.pytorch.org/whl/cu100/torch-1.0.1.post2-cp36-cp36m-linux_x86_64.whl
張介騰
  • 1
  • 1
0

Go to https://pytorch.org/ copy a command which is in "Run this Command:" box. Do not select anything, just select copy command and paste in your editor you are using. I hope it works. For me, it works fine.

For RTX 2070

Tip 1

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch

Tip 2

conda install pytorch-nightly cudatoolkit=10.0 -c pytorch
Khawar Islam
  • 2,556
  • 2
  • 34
  • 56