I have used the Transformer model to train the time series dataset, but there is always a gap between training and validation in my loss curve. I have tried using different learning rates, batch sizes, dropout, heads, dim_feedforward, and layers, but they don't work. Can anyone give me some ideas on reducing the gap between them?
I also tried to ask the question on the Pytorch forum but didn't get any reply. How to design a decoder for time series regression in Transformer?