InvalidArgumentError: Mismatch between the current graph and the graph from the checkpoint

Question

So I am basically using this transformer implementation for my project: https://github.com/Kyubyong/transformer . It works great on the German to English translation it was originally written for and I modified the processing python script in order to create vocabulary files for the languages that I want to translate. This seems to work fine.

However when it comes to training I get the following error:

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [9796,512] rhs shape= [9786,512] [[{{node save/Assign_412}} = Assign[T=DT_FLOAT, _class=["loc:@encoder/enc_embed/lookup_table"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](encoder/enc_embed/lookup_table/Adam_1, save/RestoreV2:412)]]

Now I have no idea why I am getting the above error. I also reverted to the original code to translate from German to English and now I get the same error (except the lhs and rhs tensor shapes are different of course), when before it was working!

Any ideas on why this could be happening?

Thanks in advance

EDIT: This is the specific file in question here, the train.py when it is run: https://github.com/Kyubyong/transformer/blob/master/train.py Nothing has been modified other than the fact that the vocab loaded for de and en are differently (they're in fact vocab files with single letters as words). However as I mentioned that even when resorting back to the prevous working example I get the same error with different lhs and rhs dimensions.

Can you create a [minimal verifiable](https://stackoverflow.com/help/mcve) example? Are you doing something like using a `Jupyter notebook` where you're retaining state? — IanQ, Oct 24 '18 at 20:15
@user49593 Hi, I will do that and edit it in thanks. But in the mean time, yes, I believe it must have something to do with it point to an old graph model or something. But I've restarted the terminal and looked into clearing some kind of cache, but no success :( — noob, Oct 24 '18 at 22:19
@user49593 I have just added a specific link to the file via github, don't think it's feasible to create a shorter snippet. Thanks — noob, Oct 24 '18 at 22:45

score 2 · Answer 1 · answered Mar 12 '19 at 10:57

I was getting a similar error, I my case it seems that the output of previous failed jobs was remaining on the output dir and there were some incompatibilities when saving/restoring the checkpoints of the new job, so I just cleaned it up the output dir and then the new job worked correctly.

siraj pathan · Answer 2 · 2020-05-15T09:27:30.290

I was facing same issue while exporting/saving the model. I was referring to example given in this URL: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md

There are three things which you have to make sure are correct, if you are facing above issue:

Cleanup model directory and extract fresh model
Make sure that you are using correct pair of pipeline-config file and its corresponding TF model.
use correct model checkpoint. see below example for that:

I updated my TRAINED_CKPT_PREFIX value to the saving point of my model and it worked from me (see below example):

TRAINED_CKPT_PREFIX=./data/model.ckpt-139

In your case please use your saving point number in my case it is 139

Previously I was using ./data/model.ckpt only which was not working.

score 0 · Answer 3 · answered Oct 30 '18 at 22:35

The large number is almost certainly the size of your vocabulary. The initial matrix will have size [vocab_size, hidden_dim]. So by changing the size of your vocab you are breaking things.

Presumably the solution is just to make sure you clean out all your checkpoints so that you are only looking at models trained with the vocabulary you want.

InvalidArgumentError: Mismatch between the current graph and the graph from the checkpoint

3 Answers3