Debugging the optmization run while training variables of a pre-trained tensorflow model

Question

I am loading a pre-trained model and then extracting only the trainable variables which I want to optimize (basically change or fine-tune) according to my custom loss. The problem is the moment I pass a mini-batch of data to it, it just hangs and there is no progress. I used Tensorboard for visualization but don't know how to debug when there is no log info available. I had put some basic print statements around it but didn't get any helpful information.

Just to give an idea, this is the piece of code sequentially

# Load and build the model
model = skip_thoughts_model.SkipThoughtsModel(model_config, mode="train")
with tf.variable_scope("SkipThoughts"):
    model.build()
    theta = [v for v in tf.get_collection(tf.GraphKeys.MODEL_VARIABLES, scope='SkipThoughts') if "SkipThoughts" in v.name]

# F Representation using Skip-Thoughts model
opt_F = tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])

# Training
sess.run([opt_F], feed_dict = {idx: idxTensor})

And the model is from this repository: The problem is with training i.e. the last step. I verified that the theta list is not empty it has 26 elements in it, like ... SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/beta:0 SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/gamma:0 SkipThoughts/logits/weights:0 SkipThoughts/logits/biases:0 SkipThoughts/decoder_post/gru_cell/gates/layer_norm/w_h/beta:0 ...

Also, even after using tf.debug the issue remains. Maybe it really takes lot of time or is stuck awaiting for some other process? So, I also tried breaking down the

tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])

step into

gvs = tf.train.AdamOptimizer(learning_rate).compute_gradients(model.total_loss, var_list=theta) 
opt_F = opt.apply_gradients(gvs)
...
g = sess.run(gvs, feed_dict = {idx: idxTensor})

so that I can check if the gradients are computed in the first place, which got stuck at the same point. In addition to that, I also tried computing the gradients with tf.gradients over just one of the variables and that too for one dimension, but the issue still exists.

I am running this piece of code on an IPython notebook on Azure Cluster with 1 GPU Tesla K80. The GPU usage stays the same throughout the execution and there is no out of memory error.

The kernel interrupt doesn't work and the only way to stop it is by restarting the notebook. Moreover, if I compile this code into a Python file then too I need to explicitly kill the process. However, in any such case I don't get the stack trace to know what is the exact place it is stuck! How should one debug such an issue?

Any help and pointers in this regard would be much appreciated.

Debugging the optmization run while training variables of a pre-trained tensorflow model

0 Answers0