pytorch torch.load load_checkpoint and learning_rate

Question

Following this medium post, I understand how to save and load my model (or at least I think I do). They say the learning_rate is saved. However, looking at this person's code (it's a github repo with lots of people watching, forking, etc. so I'm assuming it shouldn't be filled with mistakes), the person writes:

def load_checkpoint(checkpoint_file, model, optimizer, lr):
    print("=> Loading checkpoint")
    checkpoint = torch.load(checkpoint_file, map_location=config.DEVICE)
    model.load_state_dict(checkpoint["state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer"])

    # If we don't do this then it will just have learning rate of old checkpoint
    # and it will lead to many hours of debugging \:
    for param_group in optimizer.param_groups:
        param_group["lr"] = lr

Why doesn't optimizer.load_state_dict(checkpoint["optimizer"]) give the learning rate of old checkpoint. If so (I believe it does), why do they say it's a problem If we don't do this then it will just have learning rate of old checkpoint and it will lead to many hours of debugging.

There is no learning rate decay anyway in the code. So should it even matter?

score 0 · Answer 1 · answered Apr 04 '22 at 03:16

0

Why doesn't optimizer.load_state_dict(checkpoint["optimizer"]) give the learning rate of old checkpoint.

With Pytorch, the learning rate is a constant variable in the optimizer object, and it can be adjusted via torch.optim.lr_scheduler.

In case you want to keep training at the point where it stopped last time, the scheduler would keep all information about the optimizer that you need to continue: the strategy to adjust the learning rate, the last epoch, the step-index that model was on, the last learning rate (this should be the same with the optimizer learning rate), then your model can keep training just like it never stop before.

>>> import torch
>>> model = torch.nn.Linear(5, 1, bias=False)
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0)
>>> scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, >>> gamma=0.1)
>>> print(scheduler.state_dict())
{'step_size': 1, 'gamma': 0.1, 'base_lrs': [0.1], 'last_epoch': 0, '_step_count': 1, 'verbose': False, '_get_lr_called_within_step': False, '_last_lr': [0.1]}

why do they say it's a problem If we don't do this then it will just have learning rate of old checkpoint and it will lead to many hours of debugging.

Normally, if you didn't touch the learning rate, it should be the same as the initial one. I guess they did something with it from other projects and just want to ensure the value of the learning rate this time.

The code that you provided was about CycleGAN, but I also found it in ESRGAN, Pix2Pix, ProGAN, SRGAN, etc. so I think they used the same utils for multiple projects.

There is no learning rate decay anyway in the code. So should it even matter?

I found no learning rate scheduler in the CycleGAN code so I believe it doesn't matter if you remove those lines, but in this case only.

answered Apr 04 '22 at 03:16

CuCaRot

1,208
7
23

So if I use the scheduler with `scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, >>> gamma=0.1)` for example, how would I go about getting the right `lr` value if I stopped at, let's say, epoch 3, saved my model and re-loaded it. If I understand well, you say that `the learning rate is a constant variable in the optimizer object` so I need to get `scheduler.state_dict()` because `optimizer.load_state_dict(checkpoint["optimizer"])` is going to give me the initial (i.e. before epoch 1) learning rate?? – FluidMechanics Potential Flows Apr 04 '22 at 09:05
I juste tested it and it seems that the optimizer object stores the `last_lr` (so if the `lr` decreased from `0.1` to `0.01`, I get `0.01` when I print `optimizer.param_groups[0]['lr']`) which is what you were saying in your initial post, sorry. But I still can't around why would one not want this behaviour? (because with those lines (`param_group["lr"] = lr`), it "resets" the learning rate to the initial one) – FluidMechanics Potential Flows Apr 04 '22 at 09:32
the `last_lr` should store in scheduler, not optimizer, the learning rate is reset just to make sure the learning rate is correct as they want. As I said, I think they took the `load` function from other project which have problem with learning rate. – CuCaRot Apr 05 '22 at 01:51
But in my code, the optimizer stores it, and the scheduler changes it, isn't it the normal behaviour? – FluidMechanics Potential Flows Apr 05 '22 at 15:44
it's normal, I mean the `last_lr` should store in the scheduler, the optimizer `state_dict` should have `lr` only, and they should be equal while training. – CuCaRot Apr 06 '22 at 01:43
What would be the point of having `last_lr` different to `lr`? Sorry if it is a naive question, but I can't get my head around why would one to have two variables there. – FluidMechanics Potential Flows Apr 06 '22 at 12:17
I think it could be implemented for the complex model, for example, you want to train 10 epochs with first 5 schedule is `StepLR` and the later is `CosineLR`, now, the `last_lr` and the `lr` are not the same anymore. – CuCaRot Apr 06 '22 at 15:00
I'm not sure I understand why they wouldn't be the same. Both are "refreshed" aren't they? It's just that `last_lr` is refreshed after the epoch and `lr` before the epoch? – FluidMechanics Potential Flows Apr 06 '22 at 15:08
I've read some issues or pull request on Pytorch github, look like they just request for the `get_last_lr` for unclear purposes. I admit that I also don't know in what case we need to use them separate. – CuCaRot Apr 06 '22 at 15:14
1

makes sense. so I guess for a quite amateur level, storing `lr` in the optimizer is enough? It seems that the scheduler changes its value so even if I save my model and I load it, it loads the right `lr` value (e.g. if it did decrease, it stays decreased) – FluidMechanics Potential Flows Apr 06 '22 at 15:17
yes, I think so – CuCaRot Apr 06 '22 at 15:22

pytorch torch.load load_checkpoint and learning_rate

1 Answers1