TensorFlow crashes with error CUDNN_STATUS_BAD_PARAM

Question

I'm running fine tuning on Inception ResNet v2 using Keras 2.1.4 with TensorFlow 1.5 back end.

My training crashed before the end of the 2nd epoch with the following error message:

Epoch 1/50
8103/8103 [==============================] - 3197s 395ms/step - loss: 0.0519 - f1: 0.4272 - precision: 0.6371 - recall: 0.3239 - val_loss: 0.0363 - val_f1: 0.5000 - val_precision: 0.7314 - val_recall: 0.3807
Epoch 2/50
8102/8103 [============================>.] - ETA: 0s - loss: 0.0425 - f1: 0.4800 - precision: 0.6890 - recall: 0.36922018-02-18 00:21:16.677165:

F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 32 spatial: 149 149  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
22018-02-18 00:21:16.677165: F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 32 spatial: 149 149  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
2018-02-18 00:21:16.677219: F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 32 spatial: 149 149  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
2018-02-18 00:21:16.677347: F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 32 spatial: 147 147  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
Aborted (core dumped)

I before this may be related to tensorflow GPU crashes for 0 batch size CUDNN_STATUS_BAD_PARAM

However, if it's the same problem I don't understand why the 1st epoch completed successfully and the crash happened only at the end of the 2nd epoch.

TensorFlow crashes with error CUDNN_STATUS_BAD_PARAM

0 Answers0