TensorFlow: could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR but no other TF instances running

Question

I am trying to run some basic transfer learning code using VGG16. I am using Ubuntu 16.04, TensorFlow 1.3 and Keras, and I have 4 1080ti GPUs.

When I get to this line of code:

 datagen = ImageDataGenerator(rescale=1. / 255)
 model = applications.VGG16(include_top=False, weights='imagenet')

The output of nvidia-smi shows this:

Processes:                                                       GPU Memory |
| GPU       PID  Type  Process name                                   Usage   

|    0     14241    G   /usr/lib/xorg/Xorg                             256MiB |
|    0     14884    G   compiz                                         155MiB |
|    0     16497    C   /home/simon/anaconda3/bin/python             10267MiB |
|    1     16497    C   /home/simon/anaconda3/bin/python             10611MiB |
|    2     16497    C   /home/simon/anaconda3/bin/python             10611MiB |
|    3     16497    C   /home/simon/anaconda3/bin/python             10611MiB |

+-----------------------------------------------------------------------------+

Then the output in terminal is

 2017-09-02 15:59:15.946927: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
 2017-09-02 15:59:15.946960: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
 2017-09-02 15:59:15.946973: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

And my jupyter notebook kernal dies.

Clearly this is a memory issue, but I don't understand why all of a sudden my GPUs are taken up by this bit of code. I should add that this problem only began in the last 24 hours and all of this code was running fine a day ago. There are many answers to similar problems here but they all refer to other instances of TF running (and suggest shutting them down). In my case, this is the only TF application running (or any other application).

are you sure that 16497 is not a zombie process? Have you tried rebooting? — Robert Crovella, Sep 02 '17 at 16:04
No Robert. I looked for zombie processes and there are none. It's really annoying, because pretty much every answer on SO related to this problem suggests this is caused by a memory drain from zombie processes... — GhostRider, Sep 03 '17 at 17:30

score 1 · Answer 1 · answered Feb 05 '18 at 15:22

1

Try killing all python processes, then delete ~/.nv folder and run it again. It worked for me having the same error.

answered Feb 05 '18 at 15:22

Félix Fu

447
4
4

score 0 · Answer 2 · answered Sep 05 '17 at 19:19

That CHECK could fail because of reasons other than ShouldIncludeWinogradNonfusedAlgo(). For example if the cudnnSupport instance failed to get created, the CHECK would also fail. I'd suggest you post a more detailed issue on github and I can take a look. But updating CUDA driver and then reinstall cudnn can be the first thing to try. Basically to make sure that the CUDA and cudnn environment has not been changed recently. Also, a minimal reproducer is preferred if possible. Thank you!

score 0 · Answer 3 · answered Mar 16 '18 at 20:19

0

Worked around by strickon here. I too was able to get it to work by choosing a percentage that worked, i.e. 0.7.:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.7
session = tf.Session(config=config, ...)

answered Mar 16 '18 at 20:19

SpeedCoder5

8,188
6
33
34

TensorFlow: could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR but no other TF instances running

3 Answers3