0

I am training small transformer encoder - transformer decoder translation model using small datasets.

Size of my dataset is less than 200k.

When training transformer with low resource datasets, below 2 papers suggests to use learning rate 2 (reference 2), or 0.2 (reference 1) respectively with Noam decay.

However, I dont know how to set learning rate 2 or 0.2 when I use Noam decay scheduler.

Because as far as I know, when I use Noam decay scheduler, learning rate is determined by model dimension, step number, and warmup step size. so I dont know how to set learning rate 2 or 0.2, and what it means.

When I modified Noam decay scheduler and made linear warm up and square root decay, with peak 0.2, the model does not converge.

Thanks in advance.

reference papers

  1. https://aclanthology.org/C18-1054.pdf
  2. https://aclanthology.org/2021.mtsummit-research.5.pdf

1 Answers1

0

I got an answer through more googling,

If I want to use optimizer's learning rate with noam decay scheduler,

Just multiply learning rate to scheduler's lrate function determined by 3 components.