I am training small transformer encoder - transformer decoder translation model using small datasets.
Size of my dataset is less than 200k.
When training transformer with low resource datasets, below 2 papers suggests to use learning rate 2 (reference 2), or 0.2 (reference 1) respectively with Noam decay.
However, I dont know how to set learning rate 2 or 0.2 when I use Noam decay scheduler.
Because as far as I know, when I use Noam decay scheduler, learning rate is determined by model dimension, step number, and warmup step size. so I dont know how to set learning rate 2 or 0.2, and what it means.
When I modified Noam decay scheduler and made linear warm up and square root decay, with peak 0.2, the model does not converge.
Thanks in advance.
reference papers