learning rate AdamW Optimizer

Question

I train with BERT (from huggingface) sentiment analysis which is a NLP task.

My question refers to the learning rate.

EPOCHS = 5                                                                                                                                                                                
optimizer = AdamW(model.parameters(), lr=1e-3, correct_bias=True)                  
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(                                    
  optimizer,
  num_warmup_steps=0,                                                          
  num_training_steps=total_steps
)
loss_fn = nn.CrossEntropyLoss().to(device)

Can you please explain how to read 1e-3?

Is this the density of steps or is this a value to decay.

If the latter, is it a linear decay?

If I train with a value 3e-5, which is a recommended value of huggingface for NLP tasks, my model overfits very quickly: loss for training decreases to a minimum, loss for validation increases.

Learning rate 3e-5:

3e-5

If I train with a value of 1e-2, I get a steady improvement in the loss value of validation. but the validation accuracy does not improve after the first epoch. See picture. Why does the validation value not increase, even though the loss falls. Isn't that a contradiction? I thought these two values were an interpretation of each other.

Learning rate 1e-2:

1e-2

What would you recommend?

score 0 · Accepted Answer · answered Jun 26 '20 at 11:57

0

Each update to the weights done in the backpropagation is weighted with a parameter called learning rate. If smaller, you are stepping with a smaller step size towards the minimum.

It is sometimes advised if you are overfitting to decrease the learning rate (and increasing the number of epochs) but there are also other ways of combating overfitting. A good learning rate should let you reach a good minimum in a adequate time. This is something you need to experiment with.

answered Jun 26 '20 at 11:57

N. Kiefer

337
4
12

But what is a smal step? in terms of Xe-Y is 1e-1 the smallest and 5e-5 the biggest step? which of them is 0.00001? – Skalonga Jun 26 '20 at 15:27
1

This is more about float notation, but meant is the scientific notation on x*10^y so 1e-1 translates to 0.1 and 5e-5 to 0.00005. – N. Kiefer Jun 26 '20 at 15:41

learning rate AdamW Optimizer

1 Answers1