I am trying to run some basic transfer learning code using VGG16. I am using Ubuntu 16.04, TensorFlow 1.3 and Keras, and I have 4 1080ti GPUs.
When I get to this line of code:
datagen = ImageDataGenerator(rescale=1. / 255)
model = applications.VGG16(include_top=False, weights='imagenet')
The output of nvidia-smi shows this:
Processes: GPU Memory |
| GPU PID Type Process name Usage
| 0 14241 G /usr/lib/xorg/Xorg 256MiB |
| 0 14884 G compiz 155MiB |
| 0 16497 C /home/simon/anaconda3/bin/python 10267MiB |
| 1 16497 C /home/simon/anaconda3/bin/python 10611MiB |
| 2 16497 C /home/simon/anaconda3/bin/python 10611MiB |
| 3 16497 C /home/simon/anaconda3/bin/python 10611MiB |
+-----------------------------------------------------------------------------+
Then the output in terminal is
2017-09-02 15:59:15.946927: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-09-02 15:59:15.946960: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-09-02 15:59:15.946973: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
And my jupyter notebook kernal dies.
Clearly this is a memory issue, but I don't understand why all of a sudden my GPUs are taken up by this bit of code. I should add that this problem only began in the last 24 hours and all of this code was running fine a day ago. There are many answers to similar problems here but they all refer to other instances of TF running (and suggest shutting them down). In my case, this is the only TF application running (or any other application).