Keras model params are all "NaN"s after reloading

Question

I use transfer learning with Resnet50. I create a new model out of the pretrained model provided by Keras (the 'imagenet').

After training my new model, I save it as following:

# Save the Siamese Network architecture
siamese_model_json = siamese_network.to_json()
with open("saved_model/siamese_network_arch.json", "w") as json_file:
    json_file.write(siamese_model_json)
# save the Siamese Network model weights
siamese_network.save_weights('saved_model/siamese_model_weights.h5')

And later, I reload it as following to make some predictions:

json_file = open('saved_model/siamese_network_arch.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
siamese_network = model_from_json(loaded_model_json)
# load weights into new model
siamese_network.load_weights('saved_model/siamese_model_weights.h5')

Then I check if the weights look reasonable as following (from 1 of the layers):

print("bn3d_branch2c:\n",
      siamese_network.get_layer('model_1').get_layer('bn3d_branch2c').get_weights())

If I train my network for 1 epoch only, I see reasonable values there..

But if I train my model for 18 epochs (which takes 5-6 hours as I have a very slow computer), I just see NaN values as following:

bn3d_branch2c:
 [array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       ...

What is the trick here?

ADDENDUM 1:

Here is how I create my model.

Here, I have a triplet_loss function that I will need later on.

def triplet_loss(inputs, dist='euclidean', margin='maxplus'):
    anchor, positive, negative = inputs
    positive_distance = K.square(anchor - positive)
    negative_distance = K.square(anchor - negative)
    if dist == 'euclidean':
        positive_distance = K.sqrt(K.sum(positive_distance, axis=-1, keepdims=True))
        negative_distance = K.sqrt(K.sum(negative_distance, axis=-1, keepdims=True))
    elif dist == 'sqeuclidean':
        positive_distance = K.sum(positive_distance, axis=-1, keepdims=True)
        negative_distance = K.sum(negative_distance, axis=-1, keepdims=True)
    loss = positive_distance - negative_distance
    if margin == 'maxplus':
        loss = K.maximum(0.0, 2 + loss)
    elif margin == 'softplus':
        loss = K.log(1 + K.exp(loss))

    returned_loss = K.mean(loss)
    return returned_loss

And here is how I construct my model from start to end. I give the complete code to give the exact picture.

model = ResNet50(weights='imagenet')

# Remove the last layer (Needed to later be able to create the Siamese Network model)
model.layers.pop()

# First freeze all layers of ResNet50. Transfer Learning to be applied.
for layer in model.layers:
    layer.trainable = False

# All Batch Normalization layers still need to be trainable so that the "mean"
# and "standard deviation (std)" params can be updated with the new training data
model.get_layer('bn_conv1').trainable = True
model.get_layer('bn2a_branch2a').trainable = True
model.get_layer('bn2a_branch2b').trainable = True
model.get_layer('bn2a_branch2c').trainable = True
model.get_layer('bn2a_branch1').trainable = True
model.get_layer('bn2b_branch2a').trainable = True
model.get_layer('bn2b_branch2b').trainable = True
model.get_layer('bn2b_branch2c').trainable = True
model.get_layer('bn2c_branch2a').trainable = True
model.get_layer('bn2c_branch2b').trainable = True
model.get_layer('bn2c_branch2c').trainable = True
model.get_layer('bn3a_branch2a').trainable = True
model.get_layer('bn3a_branch2b').trainable = True
model.get_layer('bn3a_branch2c').trainable = True
model.get_layer('bn3a_branch1').trainable = True
model.get_layer('bn3b_branch2a').trainable = True
model.get_layer('bn3b_branch2b').trainable = True
model.get_layer('bn3b_branch2c').trainable = True
model.get_layer('bn3c_branch2a').trainable = True
model.get_layer('bn3c_branch2b').trainable = True
model.get_layer('bn3c_branch2c').trainable = True
model.get_layer('bn3d_branch2a').trainable = True
model.get_layer('bn3d_branch2b').trainable = True
model.get_layer('bn3d_branch2c').trainable = True
model.get_layer('bn4a_branch2a').trainable = True
model.get_layer('bn4a_branch2b').trainable = True
model.get_layer('bn4a_branch2c').trainable = True
model.get_layer('bn4a_branch1').trainable = True
model.get_layer('bn4b_branch2a').trainable = True
model.get_layer('bn4b_branch2b').trainable = True
model.get_layer('bn4b_branch2c').trainable = True
model.get_layer('bn4c_branch2a').trainable = True
model.get_layer('bn4c_branch2b').trainable = True
model.get_layer('bn4c_branch2c').trainable = True
model.get_layer('bn4d_branch2a').trainable = True
model.get_layer('bn4d_branch2b').trainable = True
model.get_layer('bn4d_branch2c').trainable = True
model.get_layer('bn4e_branch2a').trainable = True
model.get_layer('bn4e_branch2b').trainable = True
model.get_layer('bn4e_branch2c').trainable = True
model.get_layer('bn4f_branch2a').trainable = True
model.get_layer('bn4f_branch2b').trainable = True
model.get_layer('bn4f_branch2c').trainable = True
model.get_layer('bn5a_branch2a').trainable = True
model.get_layer('bn5a_branch2b').trainable = True
model.get_layer('bn5a_branch2c').trainable = True
model.get_layer('bn5a_branch1').trainable = True
model.get_layer('bn5b_branch2a').trainable = True
model.get_layer('bn5b_branch2b').trainable = True
model.get_layer('bn5b_branch2c').trainable = True
model.get_layer('bn5c_branch2a').trainable = True
model.get_layer('bn5c_branch2b').trainable = True
model.get_layer('bn5c_branch2c').trainable = True

# Used when compiling the siamese network
def identity_loss(y_true, y_pred):
    return K.mean(y_pred - 0 * y_true)  

# Create the siamese network

x = model.get_layer('flatten_1').output # layer 'flatten_1' is the last layer of the model
model_out = Dense(128, activation='relu',  name='model_out')(x)
model_out = Lambda(lambda  x: K.l2_normalize(x,axis=-1))(model_out)

new_model = Model(inputs=model.input, outputs=model_out)

anchor_input = Input(shape=(224, 224, 3), name='anchor_input')
pos_input = Input(shape=(224, 224, 3), name='pos_input')
neg_input = Input(shape=(224, 224, 3), name='neg_input')

encoding_anchor   = new_model(anchor_input)
encoding_pos      = new_model(pos_input)
encoding_neg      = new_model(neg_input)

loss = Lambda(triplet_loss)([encoding_anchor, encoding_pos, encoding_neg])

siamese_network = Model(inputs  = [anchor_input, pos_input, neg_input], 
                        outputs = loss) # Note that the output of the model is the 
                                        # return value from the triplet_loss function above

siamese_network.compile(optimizer=Adam(lr=.0001), loss=identity_loss)

One thing to notice is that I make all batch normalization layers "trainable" so that BN related params can be updated with my training data. This creates a lot of lines but I could not find a shorter solution.

You tried the json library? json.dumps and json.load let you write and read json to and from a file. I don't think that 'write' preserves the json datastructure it sees it as a string. — Neil, Jul 11 '18 at 19:28
I can try it. But it will take hours to see the result. Do you think you can suggest a more concrete solution. How should it really look when using json.dumps/load? — edn, Jul 11 '18 at 19:40
Try this: import json siamese_model_json = siamese_network.to_json() json_string = json.dumps(siamese_model_json) with open("saved_model/siamese_network_arch.json", "w") as json_file: json_file.write(json_string) Then later import json json_file = open('saved_model/siamese_network_arch.json', 'r') loaded_model_json = json.load(json_file.read()) Alternatively, it looks like Keras has some functionality to just save the whole thing? https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model — Neil, Jul 11 '18 at 20:12
Sorry can't offer more help than that been a while since I've used keras. But if your weights are just an array then I wouldn't debug by retraining the whole model. Just make a dummy array of the same length and see whether it loads. — Neil, Jul 11 '18 at 20:14
Have you looked into other ways of saving the model like `.save()` method? Or do you really need it to be in JSON format? And plus, what is the loss value during training when you train it for 18 epochs? Does it decrease? Or at some point during training, does it become `NaN` as well? — today, Jul 11 '18 at 21:00
Further, how do you do the transfer learning? Are you just extracting features from some of the layers and feed them to another classifier or are you tuning some the layers of ResNet50 as well? — today, Jul 11 '18 at 21:04
Yes, the loss decreases consistently. I run the evaluate function on the dev set and even that shows good results. I guess I am doing the transfer learning part in the right way. In my face recognition project, I train it for 8 unique persons and I have in total 185 samples in my train set (each have 1 anchor img, 1 positive img, and 1 negative img). After 18 epochs, the loss decreases as desired. I have not tried the save method, it is one thing that I will try to apart from @Neil's suggestion. But again, if I try it after 1 epoch only, it goes smoothly. — edn, Jul 11 '18 at 21:12
I find a similar problem here: ( https://github.com/keras-team/keras/issues/2378 ). The problem there seems to be related to the learning rate, which is 0.0001 in my case and I use Adam optimizer. Have you ever seen this? I dont know if the original ResNet50 model (for imagenet) from Keras has really been trained with Adam or another optimizer. — edn, Jul 11 '18 at 22:28
@edn Please mention the user you are replying to at the beginning of your comment (by @person_user_name) to notify him/her. I was not notified. When learning rate is high the training process may diverge and therefore produces NaN loss and values. I don't know the learning rate used for ResNet50 so I can't comment on that. Still, I have doubts about whether you are doing it right or not. For fine-tuning the network you must use a very small learning rate like `1e-4` or `3e-5` or `1e-5` depending on the optimizer you use. And I still advise you to use `model.save()` and `load_model()` methods. — today, Jul 11 '18 at 22:57
Very strange because when I wrote @today as the first word, it did not allow to do so saying that you would be notified anyways. Anyhow, I added abovehow I create my network. I spent quite a lot of time to figure out how to implement it in Keras. If you have experience in triplet_loss based models, it would have golden value for me to get your feedback. I am also trying to train my network with different learning rates now and will also try the option with model.save() and load_model(). — edn, Jul 11 '18 at 23:12
@edn Unfortunately, I have not worked with triplet loss models before. But I will read your code (and maybe about triplet loss) and if anything comes to my mind I will let you know. In the meantime I hope others who are more familiar with this could help you. And if the issue was resolved or you tried something new, please let us know by editing your post. — today, Jul 11 '18 at 23:42
Thank you for your comments, @today. I am trying different things now and will update the post accordingly. Difficult to find help for triplet loss so far but I guess the problem is more related to learning rate in my case. Looking for evidence now. (Anrew Ng has great videos on triplet loss, in case you want to check..) — edn, Jul 11 '18 at 23:51
If you have runaway or exploding weights, check out this post:https://stackoverflow.com/questions/42264567/keras-ml-library-how-to-do-weight-clipping-after-gradient-updates-tensorflow-b — swiftg, Jul 12 '18 at 00:15
@Gurmeet Singh, this is great tips! I just retrained my network once again for 7 epochs with the learning rate = 0.0005 (which is higher). And I see the same problem. The model seems to being saved as expected. But when I later check my weights, I see that weights of un-trainable layers are there. But for all trainable layers, weights are NaN. I have ~54 trainable layers and ~125 untrainable layers. If I go for weight clipping, it will be a big thing to fix. I am now planning to go for clipnorm option to normalize gradients and will also set a smaller learning rate, which hopefully will help. — edn, Jul 12 '18 at 00:48

score 0 · Accepted Answer · answered Jul 12 '18 at 04:50

The solution is inspired from @Gurmeet Singh's recommendation above.

Seemingly, weights of trainable layers have become so big after a while during the training and all such weights are set to NaN, which made me think that I was saving and reloading my models in the wrong way but the problem was exploding gradients.

I saw a similar issue in github discussions too, which can be checked out here: github.com/keras-team/keras/issues/2378 At the bottom of that thread in github, it is recommended to use lower learning rates to avoid the problem.

In this link (Keras ML library: how to do weight clipping after gradient updates? TensorFlow backend), 2 solutions are discussed: - using the clipvalue parameter in the optimizer, which simply cuts the calculated gradient values as configured. But this is not the recommended solution to go for.(Explained in the other thread.) - and the second thing is to use the clipnorm parameter, which simply clips calculated gradient values when their L2 norm exceeds the given value by the user.

I also thought about using input normalization (to avoid exploding gradients) but then figured out that it is already done in the preprocess_input(..) function. (Check this link for details: https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/preprocess_input) It is though possible to set the mode parameter to "tf" (set to "caffe" by default otherwise), which could further help (because mode="tf" setting scales pixels between -1 and 1) but I did not try it.

I summary, I changed 2 things when compiling my model that will be trained:

The line that has been changed is the following:

Before the change:

siamese_network.compile(optimizer=Adam(**lr=.0001**), 
                        loss=identity_loss)

After the change:

siamese_network.compile(optimizer=Adam(**lr=.00004**, **clipnorm=1.**),
                        loss=identity_loss)

1) Used a smaller learning rate to make gradient updates a bit smaller 2) Used the clipnorm parameter to normalize calculated gradients and cut them.

And I trained my network again for 10 epochs. The loss decreases as desired, but more slowly now. And I do not experience any problems when saving and storing my model. (At least after 10 epochs (it takes time on my computer).)

Note that I set the value of clipnorm to 1. This means that the L2 norm of gradients is calculated first and if the calculated normalized gradient exceeds the value of "1", the gradient is clipped. I assume this is a hyperparameter that can be optimized, that affects the time needed to train the model while helping to avoid exploding gradients problem.

Keras model params are all "NaN"s after reloading

1 Answers1

Linked