encounter error during deeplab v3+ training on Cityscapes Semantic Segmentation Dataset

Question

all,

I start the training process using deeplab v3+ following this guide. However, after step 1480, I got the error:

Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_2

The detailed train log is here

Could someone suggest how to solve this issue? THX!

I use TF1.6 in my system – Milton Wong Mar 13 '18 at 09:28 — Milton Wong, Mar 13 '18 at 09:28

score 7 · Answer 1 · answered Mar 13 '18 at 15:38

7

Based on the log, it seems that you are training with batch_size = 1, fine_tune_batch_norm = True (default value). Since you are fine-tuning batch norm during training, it is better to set batch size as large as possible (see comments in train.py and Q5 in FAQ). If only limited GPU memory is available, you could fine-tune from the provided pre-trained checkpoint, set smaller learning rate and fine_tune_batch_norm = False (see model_zoo.md for details). Note make sure the flag tf_initial_checkpoint has correct path to the desired pre-trained checkpoint.

answered Mar 13 '18 at 15:38

Liang-Chieh Chen

71
1

1

What would you consider "limited GPU memory"? I am using a GTX 1060 with 6GB of dedicated memory and get the same error. Do you think that using MobileNet instead of Xception would allow me to train locally? – srcolinas Mar 17 '18 at 22:50
Is it possible to achieve the same accuracy with a small batch_size, perhaps by training longer? With ~18GB of memory between two GPUs I'm only able to fit 2 batches in memory it seems – KRish Apr 16 '18 at 21:02

encounter error during deeplab v3+ training on Cityscapes Semantic Segmentation Dataset

1 Answers1