1

I applied this tutorial https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb (on a different dataset), the turorial did not compute the mean squared error from individual output, so I added the following line in the comparison function:

    mean_squared_error(signal_true,signal_pred)

but the loss and mse from the prediction were different from loss and mse from the model.evaluation on the test data. The errors from the model.evaluation (Loss, mae, mse) (test-set):

    [0.013499056920409203, 0.07980187237262726, 0.013792216777801514]

the error from individual target (outputs):

    Target0 0.167851388666284
    Target1 0.6068108648555771
    Target2 0.1710370357827747
    Target3 2.747463225418181
    Target4 1.7965991690103074
    Target5 0.9065426398192563 

I think it might a problem in training the model but i could not find where is it exactly. I would really appreciate your help.

thanks

Manal
  • 75
  • 3
  • 12

2 Answers2

1

There are a number of reasons that you can have differences between the loss for training and evaluation.

  • Certain ops, such as batch normalization, are disabled on prediction- this can make a big difference with certain architectures, although it generally isn't supposed to if you're using batch norm correctly.
  • MSE for training is averaged over the entire epoch, while evaluation only happens on the latest "best" version of the model.
  • It could be due to differences in the datasets if the split isn't random.
  • You may be using different metrics without realizing it.

I'm not sure exactly what problem you're running into, but it can be caused by a lot of different things and it's often difficult to debug.

markemus
  • 1,702
  • 15
  • 23
  • thanks for your reply, the difference in the performance is between model.prediction and model.evaluation. The evaluation was on the test set and the prediction was on the same test set. I calculated the MSE manually and using the MSE from keras.metrics and I got the same result. So I thought it was a problem in training the model but I could not figure out where is the bug exactly. – Manal Mar 26 '20 at 11:51
  • The training was through model.fit.generator() and the evaluation using model.evaluation() on the test data and I got nearly similar results, but the prediction which is based on the test data too, was through model.prediction() and the results did not match ones from the evaluation. – Manal Mar 26 '20 at 16:03
  • 1
    I don't know. Like I said it can be caused by a lot of different things. If it's not causing a problem for training and the difference isn't too severe you can just ignore it. If it is a serious problem you can debug it by tweaking the architecture of the model and seeing how it impacts the results. You can also look into the evaluate() function and see what it's doing. Perhaps it's using a different mse function that's giving subtly different results- I've seen that before. Maybe someone else will be able to give you more specific advice but that's all I got :) – markemus Mar 26 '20 at 19:14
0

I had the same problem and found a solution. Hopefully this is the same problem you encountered.

It turns out that model.predict doesn't return predictions in the same order generator.labels does, and that is why MSE was much larger when I attempted to calculate manually (using the scikit-learn metric function).

>>> model.evaluate(valid_generator, return_dict=True)['mean_squared_error']
13.17293930053711
>>> mean_squared_error(valid_generator.labels, model.predict(valid_generator)[:,0])
91.1225401637833

My quick and dirty solution:

valid_generator.reset()  # Necessary for starting from first batch
all_labels = []
all_pred = []
for i in range(len(valid_generator)):  # Necessary for avoiding infinite loop
    x = next(valid_generator)
    pred_i = model.predict(x[0])[:,0]
    labels_i = x[1]
    all_labels.append(labels_i)
    all_pred.append(pred_i)
    print(np.shape(pred_i), np.shape(labels_i))

cat_labels = np.concatenate(all_labels)
cat_pred = np.concatenate(all_pred)

The result:

>>> mean_squared_error(cat_labels, cat_pred)
13.172956865002352

This can be done much more elegantly, but was enough for me to confirm my hypothesis of the problem and regain some sanity.

Shovalt
  • 6,407
  • 2
  • 36
  • 51