-1

I'm training LSTM model for time series forecasting. This is the train loss plot.

'm

This is a one-step-ahead forecasting case, so I'm training the model using a rolling window. Here, we have 26 steps of forecasting (for every step, I train the model again). As you can see, after Epoch #25~27, the training loss suddenly will be so noisily. Why we have this behaviour?

Ps. I'm using LSTM with tanh activation. Also, I used L1 and L2 regularization, but the behaviour is the same. The layer after LSTM is a Dense layer with linear activation, I MinMaxScaler is applied on input data and the optimizer is Adam. I also see the same behaviour in validation dataset.

Eghbal
  • 3,892
  • 13
  • 51
  • 112
  • @user1269942. Yes. It postpones that behaviour (it doesn't remove). But suppose that in the above figure, we select the epoch #35 (before the noise) as the best model. When I plot the forecasting values and compare them with the real values, the forecasted values are like a horizontal line. This is not just related to the above case. It happens many times when I use early stopping. Now by increasing the `L2` regularization to `0.09` and using `clipvalue = 0.5`, it has a better behaviour. But I'm still thinking what is happening behind the scenes and how can I imrpove the performance. – Eghbal Oct 25 '19 at 16:53
  • do you have a corresponding plot for the validation loss? – user1269942 Oct 25 '19 at 18:00
  • @user1269942 To be more precise, my output data has Lévy distribution. But at the same time, I want to have good performance for both normal and outlier samples. Based on this description, should I add the plots based on output without normalization? – Eghbal Oct 27 '19 at 23:33
  • I feel like I'm stabbing in the dark...much like my day job! Have you tried a log or square-root scaling treatment for your input data? Perhaps that would reign in some of the outliers. – user1269942 Oct 28 '19 at 20:36

1 Answers1

1

Are you using gradient clipping if so not that could help you since gradient values become really , really small or large making it very difficult to make further progress for the model to learn better. The recurrent layer may have created this valley of loss that you may be missing because the gradient is too large.