6

I stumbled across the definition of mse in Keras and I can't seem to find an explanation.

def mean_squared_error(y_true, y_pred):
    return K.mean(K.square(y_pred - y_true), axis=-1)

I was expecting the mean to be taken across the batches, which is axis=0, but instead, it is axis=-1.

I also played around with it a little to see if K.mean actually behaves like the numpy.mean. I must have misunderstood something. Can somebody please clarify?

I can't actually take a look inside the cost function at run time right? As far as I know the function is called at compile time, which prevents me from evaluating concrete values.

I mean... imagine doing regression and having a single output neuron and training with a batch size of ten.

>>> import numpy as np
>>> a = np.ones((10, 1))
>>> a
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])
>>> np.mean(a, axis=-1)
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

All it does is flatten the array instead of taking the mean of all the predictions.

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
Nima Mousavi
  • 1,601
  • 2
  • 21
  • 30

3 Answers3

3

K.mean(a, axis=-1) and also np.mean(a, axis=-1) is just taking the mean across the final dimension. Here a is an array with shape (10, 1) and in this case, taking the mean across the final dimension happens to be the same as flattening it to a 1d array of shape (10,). Implementing it like so supports the more general case of e.g. multiple linear regression.

Also, you can inspect the value of nodes in the computation graph at run-time using keras.backend.print_tensor. See answer: Is there any way to debug a value inside a tensor while training on Keras?

Edit: You question appears to be about why the loss doesn't return a single scalar value but instead returns a scalar value for each data-point in the batch. To support sample weighting, Keras losses are expected to return a scalar for each data-point in the batch. See losses documentation and the sample_weight argument of fit for more information. Note specifically: "The actual optimized objective is the [weighted] mean of the output array across all data points."

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
tiao
  • 805
  • 1
  • 8
  • 20
  • 1
    I know that it does what it does. My question is: Why does it do it? The first dimension would be the batch size... so why is it not taking the mean across axis=0. – Nima Mousavi Feb 05 '18 at 13:49
2

The code is as follows:

 def mean_squared_error(y_true, y_pred):
     return K.mean(K.square(y_pred - y_true), axis=-1)

One application for choosing the axis to be -1 is for example, for colored picture, it has 3 layers RGB. Each layer has size 512 times 512 pixels and they are stored in an object of size 512 times 512 times 3.

Suppose your task involves reconstructing the picture and you store in another object of size 512 times 512 times 3.

Calling the MSE would enable you to analyze how good is your reconstruction task at each pixel. The output would be of size 512 times 512, summarizing your performance at each pixel.

Siong Thye Goh
  • 3,518
  • 10
  • 23
  • 31
1

I had the same question as you. After I did some experiment, I suppose that it does not matter to return a scalar or a tensor as the loss, the Keras (tensorflow) framework can handle it automatically. For instance, if you apply the K.tf.reduce_mean() to get a scalar rather than a vector, the framework just add one more step to calculate the gradient for the reduce_mean(). Based on the gradient chain rule, the result will not be affected.

fandulu
  • 71
  • 1
  • 1