why is my Neural Network stuck at high loss value after the first epochs

Question

I'm doing regression using Neural Networks. It should be a simple task for NN to do, I have 10 features and 1 output that I want to predict. I'm using pytorch for my project but my Model is not learning well. the loss start with a very high value (40000), then after the first 5-10 epochs the loss decrease rapidly to 6000-7000 and then it stuck there, no matter what I make.

I tried even to change to skorch instead of pytorch so that I can use cross validation functionality but that also didn’t help. I tried different optimizers and added layers and neurons to the network but that didn’t help, it stuck at 6000 which is a very high loss value. I’m doing regression here, I have 10 features and I’m trying to predict one continuous value. that should be easy to do that’s why it is confusing me more.

Here is my network: I tried here all the possibilities from making more complex architectures like adding layers and units to batch normalization, changing activations etc., but nothing has worked.

class BearingNetwork(nn.Module):
    def __init__(self, n_features=X.shape[1], n_out=1):
        super().__init__()
        self.model = nn.Sequential(
             
            nn.Linear(n_features, 512), 
            nn.BatchNorm1d(512),
            nn.LeakyReLU(),
            nn.Linear(512, 64),
            nn.BatchNorm1d(64),
            nn.LeakyReLU(),
            nn.Linear(64, n_out),
#             nn.LeakyReLU(),
#             nn.Linear(256, 128),
#             nn.LeakyReLU(),
#             nn.Linear(128, 64),
#             nn.LeakyReLU(),
#             nn.Linear(64, n_out)
        )
        
    def forward(self, x):
        out = self.model(x)
        return out

and here are my settings: using skorch is easier than pytorch. here I'm monitoring also the R2 metric and I made RMSE as a custom metric to also monitor the performance of my model. I also tried the amsgrad for Adam but that didn't help.

R2 = EpochScoring(r2_score, lower_is_better=False, name='R2')
explained_var_score = EpochScoring(EVS, lower_is_better=False, name='EVS Metric')
custom_score = make_scorer(RMSE)
rmse = EpochScoring(custom_score, lower_is_better=True, name='rmse')

bearing_nn = NeuralNetRegressor(
    
    BearingNetwork,
    criterion=nn.MSELoss,
    optimizer=optim.Adam,
    optimizer__amsgrad=True,
    max_epochs=5000,
    batch_size=128,
    lr=0.001,
    train_split=skorch.dataset.CVSplit(10),
    callbacks=[R2, explained_var_score, rmse, Checkpoint(), EarlyStopping(patience=100)],
    device=device
    
)

I also standardize the Input values.

my Input have the shape:

torch.Size([39006, 10])

and shape of output is:

torch.Size([39006, 1])

I’m using 128 as my Batch_size but I also tried other values like 32, 64, 512 and even 1024. Although normalizing output is not necessary but I also tried that and It didn’t work when I predict values, the loss is high. I'll also add a screenshot of my training and val losses and metrics over epochs to visualize how the loss is decreasing in the first 5 epochs and then it stays like forever at the value 6000 which is a very high value for a loss.

Can you try and add a `nn.BatchNorm1d` layer as the very **first** layer of your model? Does this change make any difference in the training process? — Shai, Dec 05 '19 at 15:21
@Shai thanks for the suggestion, I tried that but it didn't work — basilisk, Dec 05 '19 at 16:08
Have you tried overfitting your model on a single example and seeing if it works (i.e. you have a 0 loss or close)? — Zaccharie Ramzi, Dec 08 '19 at 19:51
@ZaccharieRamzi what do you mean by single example? if you mean single feature yes I did that and it didn't work. the loss stays stuck at a high value — basilisk, Dec 08 '19 at 19:56
no no, just a single observation with the 10 features, basically your `x` would be of size `(1, 10)`. This way you can see if your model is even able to just overfit a single example. If it can't do that, it's very unlikely it will be able to predict for more. — Zaccharie Ramzi, Dec 08 '19 at 19:58
@ZaccharieRamzi good hint thank you. yes I tried that and yes my model achieve to fit on the training data but the loss on the validation data is very high, in other words yes it overfits the training data. I tried with 50 examples not with one as you said but it worked. what can I conclude now from this? I think there is no problem with my Implementation so what should I do ? — basilisk, Dec 08 '19 at 20:44
So what it means is that the model you have has enough capacity to overfit, which means there are probably no logical error in the model (that is for example you didn't put a ReLU as a last activation for a target that's in R). — Zaccharie Ramzi, Dec 08 '19 at 21:37
Now you can try and see how the NN compares to simpler baselines, like mean target or a mean square regression (use `scikit-learn` for example). This way you will know whether the network is actually learning something very useful or not. If the NN is better than the regression, and the metric is not where you want it to be, the problem might be too difficult for a number of reasons. If the NN is not better than the regression, given there is no implementation error (we saw that with overfitting), there is perhaps an optimization problem (learning rate, batch size, etc...). — Zaccharie Ramzi, Dec 08 '19 at 21:40
@ZaccharieRamzi thanks but I don't know what you mean by mean target or mean squared regression? do you mean linear regression with sklearn ? or maybe decision tree regressor model from sklearn ? — basilisk, Dec 08 '19 at 22:07
mean target is just the average value of the target over the training dataset. Yes I meant linear regression (least square not mean square sorry), but you can try any other simple model like a decision tree definitely. It's just to whether the NN is actually better than you think or not. — Zaccharie Ramzi, Dec 08 '19 at 22:12
I tried those approaches that you suggested. Only random forest regress or gave better performance but it overfitted the data, it did awful when I predict on the test set or on cross validation but it surprised me because it wasn't computitionaly expensive and it did gave better results than the NN but only on the training data, so it overfitted faster. What should I do now since the NN and other machine learning approaches didn't work for my use case. Should I keep try optimizing the NN? I tried everything I think, maybe I ll try to initialize the weights with different approaches — basilisk, Dec 08 '19 at 23:31
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203939/discussion-between-zaccharie-ramzi-and-basilisk). — Zaccharie Ramzi, Dec 09 '19 at 21:55
Maybe try decreasing the learning rate as the epochs go on? Say every 1000 epochs halve the learning rate? You might find that helps a lot — Recessive, Dec 12 '19 at 02:37
@Recessive yes I tried that but didn't worked, I also tried cyclic LR and warm restart but it also didn't work — basilisk, Dec 12 '19 at 07:25
Did you standardize/normalize your data? In regression (and neural networks in general), you should always feed standardized/normalized features to your model. — amdex, Dec 12 '19 at 08:00
@amdex of course! I already wrote that in the question description — basilisk, Dec 12 '19 at 08:03

Brent Rohner · Answer 1 · 2019-12-03T10:33:46.430

3

considering that your training and dev loss are decreasing over time, it seems like your model is training correctly. With respect to your worry regarding your training and dev loss values, this is entirely dependent on the scale of your target values (how big are your target values?) and the metric used to compute the training and dev losses. If your target values are big and you want smaller train and dev loss values, you can normalise the target values.

From what I gather with respect to your experiments as well as your R2 scores, it seems that you are looking for a solution in the wrong area. To me, it seems like your features aren't strong enough considering that your R2 scores are low, which could mean that you have a data quality issue. This would also explain why your architecture tuning has not improved your model's performance as it is not your model that is the issue. So if I were you, I would think about what new useful features I could add and see if that helps. In machine learning, the general rule is that models are only as good as the data that they are trained on. I hope this helps!

edited Dec 03 '19 at 10:33

answered Dec 03 '19 at 10:26

Brent Rohner

41
4

the train and dev losses are decreasing but when they reach 6000 they stuck there even after 5000 epochs and when they improve then it is very slowly. normalizing the target is unnecessary, I ll get the same loss values but in other scales. that will improve nothing! my R2 scores are low because my model is not fitting the data, maybe you are right and my data doesn't have a strong relationship but as far as I know neural network can also be fitted to such complex non linearity or am I wrong? I can't add features, this is the only dataset I have for this project and I must live with it – basilisk Dec 03 '19 at 10:55
you are right that normalisation won't improve your model. But that is not why I brought it up. I mentioned it because you stated several times that your training loss is high. This could be because of how big the target values are. Therefore, by normalizing your target data, you would have smaller values to work with. What may be useful is to compare your model to others on the same task or use a baseline to see how your model compares to it. Just to see whether your model is actually doing poorly or not. – Brent Rohner Dec 03 '19 at 11:13
yes I know that you mean that. my target values are in range [0-360] so that is not so big. normalizing target will maybe give loss values between [0-1] or something but in this case the value 1 would be very bad to have. I don't know what else I can do, I think I tried all the things that I can try. Maybe I can try weights initialization, otherwise I have no idea – basilisk Dec 03 '19 at 11:20
are all of your features scaled similarly? If not, I would normalise all of your features between 0 and 1 and do the same for you target data. When it comes to evaluating the predictions, you can unnormalize your predictions. – Brent Rohner Dec 03 '19 at 11:28
I tried that and it didn'T work. after scaling the data and also the target, the loss start at 1.04 and then start decreasing until 0.7 and then it stuck there – basilisk Dec 04 '19 at 16:34
@BrentRohner can you point to some resource for understanding size of target value and normalizing the loss. I cant understand what you are referring to – A.B May 25 '20 at 09:15

score 2 · Answer 2 · answered Dec 10 '19 at 03:37

2

The metric you should be looking at is R^2, not the magnitude of the loss function. The purpose of a loss function is just to let the optimizer know if it's going in the right direction--it's not a measure of fit that's comparable across data sets and learning setups. That's what R^2 is for.

Your R^2 scores show that you're explaining around a third of the total variance in the output, which is often a very good result for a data set with only 10 features. Actually, given the shape of your data, it's more likely that your hidden layers are considerably larger than necessary and risk over fitting.

To really evaluate this model, you'd need to know (1) how the R^2 score compares to simpler regression approaches like OLS and (2) why you should have any confidence that more than 30% of the output variance should be captured by the input variables.

For #1, at least the R^2 shouldn't be worse. As for #2, consider the canonical digit categorization example. We know that all the information necessary to recognize digits with very high accuracy (i.e. R^2 approaching 1) because humans can do it. That's not necessarily the case with other data sets, because there are important sources of variance that aren't captured in the source data.

answered Dec 10 '19 at 03:37

kkoning

81
5

what is OLS ? is it the simple linear regression model in scikit learn ? I tried other ML approaches and only RandomForestRegressor gave better performance but it has also overfitted the data. on the training data it gave me small loss and 88% score (R2 score) but for the validation data it gave a 48% score which is better than the neural network score and also a better loss value but I couldn't achieve to prevent the overfitting of it even when using cross validation – basilisk Dec 10 '19 at 08:16
Yes, OLS is [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares), and is typically what's used in simple linear regression models. It's a good baseline comparison and should avoid overfitting. A R^2 of .88 on training to .48 on validation is very poor, indicating an extreme overfit. It's possible that a random forest or genetic programming approach would predict better than a NN, but with this limited dataset it would be useful to apply significant parsimony pressure to address overfit. – kkoning Dec 10 '19 at 15:14
thanks for your answer, can you please expain more? I tried a simple linear regression but it showed poor performance, r2=0.006 on the train data and 0.004 on test data. I think since it is a linear regression it would not help because my data have strong non linearity. Random Forest was better but it overfitted the data as you said. What should I do next in your opinion? – basilisk Dec 10 '19 at 15:19
To use OLS with non-linear data, you need to do transforms if input features, e.g., log(x), sqrt(x), and then regress on those. Visualizing residuals is often useful to figure this out. However, while the resulting model is more understandable, this won't capture/model discontinuities and inter-feature interaction. – kkoning Dec 10 '19 at 16:30
If your data isn't just non-linear but also has discontinuities and complex interactions, the random forest and [genetic programming](https://en.wikipedia.org/wiki/Genetic_programming) approaches may be more appropriate. To reduce overfitting, reduce the number of trees or apply "parsimony pressure". You'll also need to consider where to stop, as there's no guarantee that it's possible to extract more accurate predictions out of a data set, as there are often many causal factors that are not captured. However, it's impossible to comment on that without knowing what the data is. – kkoning Dec 10 '19 at 16:41

score 0 · Answer 3 · answered Dec 12 '19 at 05:44

As your loss decreases from 40000 to 6000, that means your NN model has learnt the prevalent relation but not all of them. You can aid this learning by transforming the predictor variables and then feeding them as derived ones to your model and see if that helps. You can try performing step wise addition of features to your NN model, by adding the most influential predictors first. At every iteration evaluate the model performance (i.e. training loss).

If first step doesn't help and as you are open to other approaches, Presuming your data's dynamics, Gaussian process Regression or Quantile regression should help as these methods are free from assumptions like linear regression techniques. Also it should help to explore different aspects of relationship between your independent and dependent variable.

why is my Neural Network stuck at high loss value after the first epochs

3 Answers3