Need to start the training twice to load checkpoint (It worked, but why?)

Question

I am modifying deeplab Network. I added a node to the mobilenet-v3 feature extractor's first layer, which reused the existing variables. As no extra parameters would be needed, I could theoretically load the old checkpoint.

Here comes the situation I couldn't understand:

when I start training in a new empty folder, load checkpoint like this:

python "${WORK_DIR}"/train.py \
  #--didn't change other parameters \
  --train_logdir="${EXP_DIR}/train" \
  --fine_tune_batch_norm=true \
  --tf_initial_checkpoint="init/deeplab/model.ckpt"

I get an Error:

ValueError: Total size of new array must be unchanged for MobilenetV3/Conv/BatchNorm/gamma lh_shape: [(16,)], rh_shape: [(480,)]

BUT, if I start training in a new empty folder, don't load any checkpoint:

python "${WORK_DIR}"/train.py \
  #--didn't change other parameters \
  --train_logdir="${EXP_DIR}/train" \
  --fine_tune_batch_norm=false \
  #--tf_initial_checkpoint="init/deeplab/model.ckpt" #i.e. no checkpoint

I could smoothly start the training.

Which made me more confusing is that, if in the same folder(which has been the train_logdir without checkpoint loaded), I try to start training with checkpoint, I could also start the training without error:

# same code as the first code block
python "${WORK_DIR}"/train.py \
  #--didn't change other parameters \
  --train_logdir="${EXP_DIR}/train" \
  --fine_tune_batch_norm=true \
  --tf_initial_checkpoint="init/deeplab/model.ckpt"

How could this happen? The --train_logdir could somehow store the shape of Batch Normalization parameters from last training?

score 0 · Accepted Answer · answered Jul 15 '20 at 19:33

I found the following code in train_utils.py: (Line 203)

    if tf.train.latest_checkpoint(train_logdir):
        tf.logging.info('Ignoring initialization; other checkpoint exists')
        return None

    tf.logging.info('Initializing model from path: %s', tf_initial_checkpoint)

It will try to load from the existing checkpoints in the train_logdir before trying to load the given checkpoint in the "tf_initial_checkpoint" flag.

So when I start the training the second time, the Network has loaded the variables from the first training, which has nothing to do with my pre-trained checkpoint.

My experiments also showed that start the training twice like me doesn't have good result as when I correctly load the pre-trained checkpoint.

Need to start the training twice to load checkpoint (It worked, but why?)

1 Answers1