GPT-2 Continue training from checkpoint

Question

I am trying to continue training from a saved checkpoint using the colab setup for GPT-2-simple at:

https://colab.research.google.com/drive/1SvQne5O_7hSdmPvUXl5UzPeG5A6csvRA#scrollTo=aeXshJM-Cuaf

But I just cant get it to work. Loading the saved checkpoint from my googledrive works fine, and I can use it to generate text, but I cant continue training from that checkpoint. In the gpt2.finetune () I am entering restore.from='latest" and overwrite=True, and I have been trying to use both same run_name and different one, and using overwrite=True, and not. I have also tried restarting the runtime in between, as was suggested, but it doesn´t help, I keep getting the following error:

"ValueError: Variable model/wpe already exists, disallowed. Did you mean to set reuse=True 
or reuse=tf.AUTO_REUSE in VarScope?"

I asume that I need to run the gpt2.load_gpt2(sess, run_name='myRun') before continue training, but whenever I have run this first, the gtp2.finetune() throws this error

jameslol · Answer 1 · 2020-04-24T04:23:42.507

3

You don't need to (and can't) run load_gpt2() before finetuning. You instead simply need to give run_name to finetune(). I agree that this is confusing as hell; I had the same trouble.

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
    file_name,
    model_name=model_name,
    checkpoint_dir=checkpoint_dir,
    run_name=run_name,
    steps=25,
)

This will automatically grab the latest checkpoint from your checkpoint/run-name folder, load its weights, and continue training where it left off. You can confirm this by checking the epoch number - it doesn't start again from 0. E.g., if you'd previously trained 25 epochs, it'll start at 26:

Training...

[26 | 7.48] loss=0.49 avg=0.49

Also note that to run finetuning multiple times (or to load another model) you normally have to restart the python runtime. You can instead run this before each finetine command:

tf.reset_default_graph()

edited Apr 24 '20 at 04:23

answered Apr 24 '20 at 04:17

jameslol

1,829
16
18

agreed on all, but my training started from scratch, I just changed the runtime version to v3 for fine-tuning, while v2 already existed. Are you sure this is the standard way it runs? – Itachi Nov 25 '20 at 15:37
In case it's useful for anyone who stumbles across this - jameslol is correct, but - you need overwrite=True if you want it to delete what's already in the folder, and - checkpoint_dir should be the directory that holds all the checkpoints (usually 'checkpoint'), NOT the directory of the specific checkpoint you want to load from. It will infer the right checkpoint from run_name. So should look something like: gpt2.finetune(sess, file_name, model_name='124M', checkpoint_dir='checkpoint', run_name=run_name, steps=25, overwrite=True) – Gabriel Feb 08 '21 at 22:19

score 1 · Answer 2 · answered Apr 13 '21 at 11:19

I've tryed the following and works fine:

tf.reset_default_graph()
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
          steps=n,
          dataset=file_name,
          model_name='model', 
          print_every=z,
          run_name= 'run_name',
          restore_from='latest',
          sample_every=x,
          save_every=y
          )

You must indicate the same 'run_name' as the model you want to resume training and hp restore_from = 'latest'

GPT-2 Continue training from checkpoint

2 Answers2