1

I want to train on CIFAR-10, suppose for 200 epochs. This is my optimizer: optimizer = optim.Adam([x for x in model.parameters() if x.requires_grad], lr=0.001) I want to use OneCycleLR as scheduler. Now, according to the documentation, these are the parameters of OneCycleLR:

torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, total_steps=None, epochs=None, steps_per_epoch=None, pct_start=0.3, anneal_strategy='cos', cycle_momentum=True, base_momentum=0.85, max_momentum=0.95, div_factor=25.0, final_div_factor=10000.0, three_phase=False, last_epoch=- 1, verbose=False)

I have seen that the most used are max_lr, epochs and steps_per_epoch. The documentation says this:

  • max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group.
  • epochs (int) – The number of epochs to train for. This is used along with steps_per_epoch in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None
  • steps_per_epoch (int) – The number of steps per epoch to train for. This is used along with epochs in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None

About steps_per_epoch, I have seen in many github repo that it is used steps_per_epoch=len(data_loader), so if I have a batch size of 128, then this parameter it is equal to 128. However I do not understand what are the other 2 parameters. If I want to train for 200 epochs, then epochs=200? Or this is a parameter that runs the scheduler only for epoch and then it restarts? For example, If I write epochs=10 inside the scheduler, but I train in total for 200, it is like 20 complete steps of the scheduler? Then max_lr I have seen people using a value greater than the lr of the optimizer and other people using a smaller value. I think that max_lr must be greater than the lr (otherwise why it is called max :smiley: ?) However, if I print the learning rate epoch by epoch, it assumes strange values. For example, in this setting:

optimizer = optim.Adam([x for x in model.parameters() if x.requires_grad], lr=0.001)

scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr = 0.01, epochs=200, steps_per_epoch=128)

And this is the learning rate:

Epoch 1: TrL=1.7557, TrA=0.3846, VL=1.4136, VA=0.4917, TeL=1.4266, TeA=0.4852, LR=0.0004,
Epoch 2: TrL=1.3414, TrA=0.5123, VL=1.2347, VA=0.5615, TeL=1.2231, TeA=0.5614, LR=0.0004,
...
Epoch 118: TrL=0.0972, TrA=0.9655, VL=0.8445, VA=0.8161, TeL=0.8764, TeA=0.8081, LR=0.0005,
Epoch 119: TrL=0.0939, TrA=0.9677, VL=0.8443, VA=0.8166, TeL=0.9094, TeA=0.8128, LR=0.0005,

So lr is increasing

desertnaut
  • 57,590
  • 26
  • 140
  • 166
CasellaJr
  • 378
  • 2
  • 11
  • 26

1 Answers1

1

The documentation says that you should give total_steps or both epochs & steps_per_epoch as arguments. The simple relation between them is total_steps = epochs * steps_per_epoch.

And total_steps is the total number of steps in the cycle. OneCycle in the name means there is only one cycle through the training.

max_lr is the maximum learning rate of OneCycleLR. To be exact, the learning rate will increate from max_lr / div_factor to max_lr in the first pct_start * total_steps steps, and decrease smoothly to max_lr / final_div_factor then.


Edit: For those who are not familiar with lr_scheduler, you can plot the learning rate curve, e.g.

EPOCHS = 10
BATCHES = 10
steps = []
lrs = []
model = ... # Your model instance
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) # Wrapped optimizer
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,max_lr=0.9,total_steps=EPOCHS * BATCHES)
for epoch in range(EPOCHS):
    for batch in range(BATCHES):
        scheduler.step()
        lrs.append(scheduler.get_last_lr()[0])
        steps.append(epoch * BATCHES + batch)

plt.figure()
plt.legend()
plt.plot(steps, lrs, label='OneCycle')
plt.show()
w568w
  • 113
  • 1
  • 10
  • So `scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr = 0.01, epochs=200, steps_per_epoch=128)` basically one cycle will be completed after 200*128 = 25600 epochs? For this reason it is still increasing the learning rate, because 200 epochs are not 25600 :D – CasellaJr Aug 24 '22 at 11:11
  • @CasellaJr Basically right. But not 25600 epochs, just 25600 steps. A *step* means a forward & backward, and an epoch can have a lot of steps, i.e. `step_in_a_epoch = your_data_set_size / batch_size`. – w568w Aug 24 '22 at 11:17
  • My batch_size is 128 and I split CIFAR-10 in 40k training, 10k val, 10k test. So, my step is 40k/128 = 312.5 – CasellaJr Aug 24 '22 at 11:20
  • I need to do something like this: `scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr = 0.01, epochs=1, steps_per_epoch=312)`? – CasellaJr Aug 24 '22 at 11:20
  • In this case 312 > 200, so how can I do? – CasellaJr Aug 24 '22 at 11:31
  • @CasellaJr Do you want to make the learning rate follow the same variation pattern in each epoch, or in the entire training process? If former, `OneCycleLR` is not for you. Try `CyclicLR`. – w568w Aug 24 '22 at 11:48
  • In the entire training process. I want that the learning rate is equal to this picture: https://discuss.pytorch.org/t/optimizer-step-before-lr-scheduler-step-error-using-gradscaler/92930 so for example at epoch 1 it is 0.001, at epoch 80 it is 0.01 and then at epoch 200 is again 0.001 (or little bit lower) – CasellaJr Aug 24 '22 at 11:56
  • @CasellaJr Then `epochs=200, steps_per_epoch=312` or `total_steps=200 * 312` should work fine. – w568w Aug 24 '22 at 12:01
  • Ok, I am gonna try with this setting and see how does it change during training. I will update you. Moreover, max_lr is set correctly? – CasellaJr Aug 24 '22 at 12:44
  • I think this setting is not correct: `Epoch 1: TrL=1.7318, TrA=0.3911, VL=1.3670, VA=0.5057, TeL=1.3560, TeA=0.5104, LR=0.00040, Epoch 2: TrL=1.3204, TrA=0.5240, VL=1.2049, VA=0.5626, TeL=1.2066, TeA=0.5702, LR=0.00040,`... `Epoch 66: TrL=0.1658, TrA=0.9428, VL=0.7534, VA=0.8093, TeL=0.8074, TeA=0.7954, LR=0.00040, Epoch 67: TrL=0.1718, TrA=0.9416, VL=0.7670, VA=0.8063, TeL=0.8020, TeA=0.8012, LR=0.00040,` so lr is remaining stable to 0.0004 – CasellaJr Aug 24 '22 at 13:10
  • @CasellaJr Have you had a look at my lastest edit on the answer? It might help you figure out the problem. If the plot looks normal, you may have some other problems in codes. – w568w Aug 24 '22 at 14:14
  • It is working, let's see the values of lr in this pastebin: https://pastebin.com/SjNAULv9 As you can see it works with these: `optimizer = optim.Adam([x for x in model.parameters() if x.requires_grad], lr=0.001) scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr = 0.01, epochs=5, steps_per_epoch=312)` If I write epochs=200 like before, lr remains stable for too much epochs, from 0.0004 it goes to 0.00041 but after more or less 60 epochs, so it need lot and lot of total training epochs to complete one cycle. Ps. thanks for the explanation on max_lr – CasellaJr Aug 24 '22 at 14:26
  • I want to improve a little the values of my last pastebin. Basically I want to reach the maximum (0.01) some epochs before, in order to start decreasing before. I think I am gonna try with epochs=4 and steps 312. I think It will work, but I have not understood why – CasellaJr Aug 24 '22 at 14:27