I want to train on CIFAR-10, suppose for 200 epochs.
This is my optimizer:
optimizer = optim.Adam([x for x in model.parameters() if x.requires_grad], lr=0.001)
I want to use OneCycleLR as scheduler. Now, according to the documentation, these are the parameters of OneCycleLR:
torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, total_steps=None, epochs=None, steps_per_epoch=None, pct_start=0.3, anneal_strategy='cos', cycle_momentum=True, base_momentum=0.85, max_momentum=0.95, div_factor=25.0, final_div_factor=10000.0, three_phase=False, last_epoch=- 1, verbose=False)
I have seen that the most used are max_lr
, epochs
and steps_per_epoch
. The documentation says this:
- max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group.
- epochs (int) – The number of epochs to train for. This is used along with steps_per_epoch in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None
- steps_per_epoch (int) – The number of steps per epoch to train for. This is used along with epochs in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None
About steps_per_epoch
, I have seen in many github repo that it is used steps_per_epoch=len(data_loader)
, so if I have a batch size of 128, then this parameter it is equal to 128.
However I do not understand what are the other 2 parameters. If I want to train for 200 epochs, then epochs=200
? Or this is a parameter that runs the scheduler only for epoch
and then it restarts? For example, If I write epochs=10 inside the scheduler, but I train in total for 200, it is like 20 complete steps of the scheduler?
Then max_lr
I have seen people using a value greater than the lr of the optimizer and other people using a smaller value. I think that max_lr
must be greater than the lr (otherwise why it is called max :smiley: ?)
However, if I print the learning rate epoch by epoch, it assumes strange values. For example, in this setting:
optimizer = optim.Adam([x for x in model.parameters() if x.requires_grad], lr=0.001)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr = 0.01, epochs=200, steps_per_epoch=128)
And this is the learning rate:
Epoch 1: TrL=1.7557, TrA=0.3846, VL=1.4136, VA=0.4917, TeL=1.4266, TeA=0.4852, LR=0.0004,
Epoch 2: TrL=1.3414, TrA=0.5123, VL=1.2347, VA=0.5615, TeL=1.2231, TeA=0.5614, LR=0.0004,
...
Epoch 118: TrL=0.0972, TrA=0.9655, VL=0.8445, VA=0.8161, TeL=0.8764, TeA=0.8081, LR=0.0005,
Epoch 119: TrL=0.0939, TrA=0.9677, VL=0.8443, VA=0.8166, TeL=0.9094, TeA=0.8128, LR=0.0005,
So lr is increasing