CIFAR10-Tensorflow training on my data: AssertionError: Model diverged with loss = NaN

Question

I created my own data to be used for training the CIFAR10 network following instructions from this post: How to create dataset similar to cifar-10. My data is stored in a file named bag1-data.bin

I edited all the source code to train the network using my data. The dataset is not so big (1149 images) and the network now must predict just two classes, so when I try to run the CIFAR10 training on these data I have the following error:

tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 
2017-05-25 04:27:17.614312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y Y 
2017-05-25 04:27:17.614346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1:   Y Y 
2017-05-25 04:27:17.614386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:03:00.0)
2017-05-25 04:27:17.614425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:04:00.0)
Traceback (most recent call last):
  File "cifar10_multi_gpu_train.py", line 272, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "cifar10_multi_gpu_train.py", line 268, in main
    train()
  File "cifar10_multi_gpu_train.py", line 240, in train
    assert not np.isnan(loss_value), 'Model diverged with loss = NaN'
AssertionError: Model diverged with loss = NaN

I read that maybe this happens because the gradient is exploding, but tried to tweak the learning rate as much as I could, and the error keeps showing even with the INITIAL_LEARNING_RATE = 0.0000000001.

I can run CIFAR network on this data, but using MatConvNet. Even copying the network parameters from that library to Tensorflow the problem continues.

What I am doing wrong? is this problem related to my own data generation? is there any parameter tweak that can help me making the training possible?

One possible cause would be, `log` and `sqrt` have infinite gradient at zero, so adding an epsilon fixes that. But without seeing your code, not much can be done. — Kh40tiK, May 25 '17 at 05:11

CIFAR10-Tensorflow training on my data: AssertionError: Model diverged with loss = NaN

0 Answers0