4

I'm running fine tuning on Inception ResNet v2 using Keras 2.1.4 with TensorFlow 1.5 back end.

My training crashed before the end of the 2nd epoch with the following error message:

Epoch 1/50
8103/8103 [==============================] - 3197s 395ms/step - loss: 0.0519 - f1: 0.4272 - precision: 0.6371 - recall: 0.3239 - val_loss: 0.0363 - val_f1: 0.5000 - val_precision: 0.7314 - val_recall: 0.3807
Epoch 2/50
8102/8103 [============================>.] - ETA: 0s - loss: 0.0425 - f1: 0.4800 - precision: 0.6890 - recall: 0.36922018-02-18 00:21:16.677165:

F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 32 spatial: 149 149  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
22018-02-18 00:21:16.677165: F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 32 spatial: 149 149  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
2018-02-18 00:21:16.677219: F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 32 spatial: 149 149  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
2018-02-18 00:21:16.677347: F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 32 spatial: 147 147  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
Aborted (core dumped)

I before this may be related to tensorflow GPU crashes for 0 batch size CUDNN_STATUS_BAD_PARAM

However, if it's the same problem I don't understand why the 1st epoch completed successfully and the crash happened only at the end of the 2nd epoch.

traveh
  • 2,700
  • 3
  • 27
  • 44

0 Answers0