How to continue training with HuggingFace Trainer?

Question

When training a model with Huggingface Trainer object, e.g. from https://www.kaggle.com/code/alvations/neural-plasticity-bert2bert-on-wmt14

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

import os
os.environ["WANDB_DISABLED"] = "true"

batch_size = 2

# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
    output_dir="./",
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    logging_steps=2,  # set to 1000 for full training
    save_steps=16,    # set to 500 for full training
    eval_steps=4,     # set to 8000 for full training
    warmup_steps=1,   # set to 2000 for full training
    max_steps=16,     # delete for full training
    # overwrite_output_dir=True,
    save_total_limit=1,
    #fp16=True, 
)


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)

trainer.train()

When it finished training, it outputs:

TrainOutput(global_step=16, training_loss=10.065429925918579, metrics={'train_runtime': 541.4209, 'train_samples_per_second': 0.059, 'train_steps_per_second': 0.03, 'total_flos': 19637939109888.0, 'train_loss': 10.065429925918579, 'epoch': 0.03})

If we want to continue training with more steps, e.g. max_steps=16 (from previous trainer.train() run) and another max_steps=160, do we do something like this?

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

import os
os.environ["WANDB_DISABLED"] = "true"

batch_size = 2

# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
    output_dir="./",
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    logging_steps=2,  # set to 1000 for full training
    save_steps=16,    # set to 500 for full training
    eval_steps=4,     # set to 8000 for full training
    warmup_steps=1,   # set to 2000 for full training
    max_steps=16,     # delete for full training
    # overwrite_output_dir=True,
    save_total_limit=1,
    #fp16=True, 
)


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)

# First 16 steps.
trainer.train()


# set training arguments - these params are not really tuned, feel free to change
training_args_2 = Seq2SeqTrainingArguments(
    output_dir="./",
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    logging_steps=2,  # set to 1000 for full training
    save_steps=16,    # set to 500 for full training
    eval_steps=4,     # set to 8000 for full training
    warmup_steps=1,   # set to 2000 for full training
    max_steps=160,     # delete for full training
    # overwrite_output_dir=True,
    save_total_limit=1,
    #fp16=True, 
)


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=tokenizer,
    args=training_args_2,
    train_dataset=train_data,
    eval_dataset=val_data,
)

# Continue training for 160 steps
trainer.train()

If the above is not the canonical way to continue training a model, how to continue training with HuggingFace Trainer?

Edited

With transformers version, 4.29.1, trying @maciej-skorski answer with Seq2SeqTrainer,

trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    resume_from_checkpoint=True
)

Its throwing an error:

TypeError: Seq2SeqTrainer.__init__() got an unexpected keyword argument 'resume_from_checkpoint'

Also asked on https://discuss.huggingface.co/t/how-to-continue-training-with-huggingface-trainer/39397/3 — alvas, May 15 '23 at 17:50
I made a slight mistake should be `trainer.train(resume_from_checkpoint=True)` as in the reference I gave. — Maciej Skorski, May 15 '23 at 20:53

Maciej Skorski · Accepted Answer · 2023-05-15T20:50:37.187

2

If your use-case is about adjusting a somewhat-trained model then it can be solved just the same way as fine-tuning. To this end, you pass the current model state along with a new parameter config to the Trainer object in PyTorch API. I would say, this is canonical :-)

The code you proposed matches the general fine-tuning pattern from huggingface docs

trainer = Trainer(
    model,
    tokenizer=tokenizer,
    training_args,
    train_dataset=...,
    eval_dataset=...,
)

You may also resume training from existing checkpoints

trainer.train(resume_from_checkpoint=True)

edited May 15 '23 at 20:50

answered May 14 '23 at 09:08

Maciej Skorski

2,303
6
14

Will the checkpoints in the report restart from 0? – alvas May 14 '23 at 12:15
1

You can enable **resuming the checkpoint** in the trainer, as explained in the docs https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.train.resume_from_checkpoint and SO https://stackoverflow.com/questions/72672281/does-huggingfaces-resume-from-checkpoint-work, when not enabled, it seems the checkpoints start fresh. – Maciej Skorski May 14 '23 at 12:26
@alvas let me know if you have any other issues (if you share an example with data I could possibly tell more), or we can accept and close? – Maciej Skorski May 14 '23 at 17:54
I think the Seq2SeqTrainer didn't inherit that argument/feature, see edited question. – alvas May 15 '23 at 17:50
@alvas, my bad - this should be passed to `train` method, not init. see the edited post. – Maciej Skorski May 15 '23 at 20:51
`resume_from_checkpoint` is right in the `trainer.train()` function argument. But it seems like there's additional step/condition that needs to be set when training the model. – alvas May 16 '23 at 09:15
When I tried loading https://huggingface.co/alvations/mt5-aym-lex-try3, `ValueError: No valid checkpoint found in output directory (mt5-aym-lex-try3)`. There should be something in the original model that needs to state that the model should be saved, maybe `save_strategy`? – alvas May 16 '23 at 09:17
Or maybe disabling `load_best_model_at_end` ? – alvas May 16 '23 at 09:19
I think I am getting it right soon, the 1st model training needs to be successfully completed so that the checkpoints are pushed to pub, then at the end the model will save at the end of specified max_steps or max_epochs, then the new trainings Trainer object needs to have a max_steps / max_epochs larger than the old one. – alvas May 16 '23 at 10:45
Well, the scope of your question was how to continue from the model at hand or freshly loaded, this is what we answered. If you like to resume checkpoints, then they should exist in first place (like "existing" in my answer). What are you trying to achieve? – Maciej Skorski May 16 '23 at 16:02
I'll assign you the bounty when the date expires if there's no better answers. But I think there's a need for some clearer documentation as to how to train a model, then fine-tune from saved checkpoints. E.g. when we train, the checkpoints are pushed to hub but the machine gets killed, and didn't reach the save part, then resuming from checkpoint may or may not work properly – alvas May 17 '23 at 04:23

How to continue training with HuggingFace Trainer?

Edited

1 Answers1