3

all,

I start the training process using deeplab v3+ following this guide. However, after step 1480, I got the error:

Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_2

The detailed train log is here

Could someone suggest how to solve this issue? THX!

Milton Wong
  • 139
  • 2
  • 5

1 Answers1

7

Based on the log, it seems that you are training with batch_size = 1, fine_tune_batch_norm = True (default value). Since you are fine-tuning batch norm during training, it is better to set batch size as large as possible (see comments in train.py and Q5 in FAQ). If only limited GPU memory is available, you could fine-tune from the provided pre-trained checkpoint, set smaller learning rate and fine_tune_batch_norm = False (see model_zoo.md for details). Note make sure the flag tf_initial_checkpoint has correct path to the desired pre-trained checkpoint.

  • 1
    What would you consider "limited GPU memory"? I am using a GTX 1060 with 6GB of dedicated memory and get the same error. Do you think that using MobileNet instead of Xception would allow me to train locally? – srcolinas Mar 17 '18 at 22:50
  • Is it possible to achieve the same accuracy with a small batch_size, perhaps by training longer? With ~18GB of memory between two GPUs I'm only able to fit 2 batches in memory it seems – KRish Apr 16 '18 at 21:02