0

I'm trying to plot the mean squared error for different theta values.

This is my df called tips:

tips = array([ 1.01,  1.66,  3.5 ,  3.31,  3.61,  4.71,  2.  ,  3.12,  1.96,
            3.23,  1.71,  5.  ,  1.57,  3.  ,  3.02,  3.92,  1.67,  3.71,
            3.5 ,  3.35,  4.08,  2.75,  2.23,  7.58,  3.18,  2.34,  2.  ,
            2.  ,  4.3 ,  3.  ,  1.45,  2.5 ,  3.  ,  2.45,  3.27,  3.6 ,
            2.  ,  3.07,  2.31,  5.  ,  2.24,  2.54,  3.06,  1.32,  5.6 ,
            3.  ,  5.  ,  6.  ,  2.05,  3.  ,  2.5 ,  2.6 ,  5.2 ,  1.56,
            4.34,  3.51,  3.  ,  1.5 ,  1.76,  6.73,  3.21,  2.  ,  1.98,
            3.76,  2.64,  3.15,  2.47,  1.  ,  2.01,  2.09,  1.97,  3.  ,
            3.14,  5.  ,  2.2 ,  1.25,  3.08,  4.  ,  3.  ,  2.71,  3.  ,
            3.4 ,  1.83,  5.  ,  2.03,  5.17,  2.  ,  4.  ,  5.85,  3.  ,
            3.  ,  3.5 ,  1.  ,  4.3 ,  3.25,  4.73,  4.  ,  1.5 ,  3.  ,
            1.5 ,  2.5 ,  3.  ,  2.5 ,  3.48,  4.08,  1.64,  4.06,  4.29,
            3.76,  4.  ,  3.  ,  1.  ,  4.  ,  2.55,  4.  ,  3.5 ,  5.07,
            1.5 ,  1.8 ,  2.92,  2.31,  1.68,  2.5 ,  2.  ,  2.52,  4.2 ,
            1.48,  2.  ,  2.  ,  2.18,  1.5 ,  2.83,  1.5 ,  2.  ,  3.25,
            1.25,  2.  ,  2.  ,  2.  ,  2.75,  3.5 ,  6.7 ,  5.  ,  5.  ,
            2.3 ,  1.5 ,  1.36,  1.63,  1.73,  2.  ,  2.5 ,  2.  ,  2.74,
            2.  ,  2.  ,  5.14,  5.  ,  3.75,  2.61,  2.  ,  3.5 ,  2.5 ,
            2.  ,  2.  ,  3.  ,  3.48,  2.24,  4.5 ,  1.61,  2.  , 10.  ,
            3.16,  5.15,  3.18,  4.  ,  3.11,  2.  ,  2.  ,  4.  ,  3.55,
            3.68,  5.65,  3.5 ,  6.5 ,  3.  ,  5.  ,  3.5 ,  2.  ,  3.5 ,
            4.  ,  1.5 ,  4.19,  2.56,  2.02,  4.  ,  1.44,  2.  ,  5.  ,
            2.  ,  2.  ,  4.  ,  2.01,  2.  ,  2.5 ,  4.  ,  3.23,  3.41,
            3.  ,  2.03,  2.23,  2.  ,  5.16,  9.  ,  2.5 ,  6.5 ,  1.1 ,
            3.  ,  1.5 ,  1.44,  3.09,  2.2 ,  3.48,  1.92,  3.  ,  1.58,
            2.5 ,  2.  ,  3.  ,  2.72,  2.88,  2.  ,  3.  ,  3.39,  1.47,
            3.  ,  1.25,  1.  ,  1.17,  4.67,  5.92,  2.  ,  2.  ,  1.75,
            3.  ])

This is my squared loss function:

def squared_loss(y_obs, theta):
"""
Calculate the squared loss of the observed data and a summary statistic.

Parameters
------------
y_obs: an observed value
theta : some constant representing a summary statistic

Returns
------------
The squared loss between the observation and the summary statistic.
"""
return (y_obs - theta) ** 2

This is my Mean Squared Error Function:

def mean_squared_error(theta, data):
    
    return sum(squared_loss(data, theta)) / len(data)

This is the problem: In the cell below plot the mean squared error for different theta values. Note that theta_values are given. Make sure to label the axes on your plot. Remember to use the tips variable we defined earlier.

theta_values = np.linspace(0, 6, 100)

plt.plot(mean_squared_error(theta_values, tips))

This gives me the following: ValueError: operands could not be broadcast together with shapes (244,) (100,)

If the plot is correct, the observed minimization point should be 3. Does anyone know what I can do to get my plot to show up? I was thinking something like a for loop but not really sure.

Thanks!

Edit: Trying 244 in theta_values, though theta_values should be a given and not be touched.

enter image description here

user3085496
  • 175
  • 1
  • 2
  • 10

1 Answers1

1

The error comes because you are trying to broadcast 2 arrays of different sizes using your mean squared error function. Your tips df has size 244, and when you created your theta values array you set it to 100 equally spaced values between 0 and 6, resulting in a size of 100.

By using

theta_values = np.linspace(0, 6, 244)

you will create a theta_values variable with 244 values which will properly map to the tips dataframe and not cause an issue when calculating your MSE.

EDIT: To accommodate OP update, assuming the plot is meant to be squared error (SE) vs theta. The entire code to compute is shown below; along with the output plot. Reminder what is plotted is the squared error (i.e. error between y_true (assumed to be tips) and y_pred (assumed to be theta) squared) versus theta. The output does seem to show less fluctuation around 3, (as suggested by OP) but more clarification from OP is required.

import numpy as np
import matplotlib.pyplot as plt

tips = np.array([ 1.01,  1.66,  3.5 ,  3.31,  3.61,  4.71,  2.  ,  3.12,  1.96,
               3.23,  1.71,  5.  ,  1.57,  3.  ,  3.02,  3.92,  1.67,  3.71,
               3.5 ,  3.35,  4.08,  2.75,  2.23,  7.58,  3.18,  2.34,  2.  ,
               2.  ,  4.3 ,  3.  ,  1.45,  2.5 ,  3.  ,  2.45,  3.27,  3.6 ,
               2.  ,  3.07,  2.31,  5.  ,  2.24,  2.54,  3.06,  1.32,  5.6 ,
               3.  ,  5.  ,  6.  ,  2.05,  3.  ,  2.5 ,  2.6 ,  5.2 ,  1.56,
               4.34,  3.51,  3.  ,  1.5 ,  1.76,  6.73,  3.21,  2.  ,  1.98,
               3.76,  2.64,  3.15,  2.47,  1.  ,  2.01,  2.09,  1.97,  3.  ,
               3.14,  5.  ,  2.2 ,  1.25,  3.08,  4.  ,  3.  ,  2.71,  3.  ,
               3.4 ,  1.83,  5.  ,  2.03,  5.17,  2.  ,  4.  ,  5.85,  3.  ,
               3.  ,  3.5 ,  1.  ,  4.3 ,  3.25,  4.73,  4.  ,  1.5 ,  3.  ,
               1.5 ,  2.5 ,  3.  ,  2.5 ,  3.48,  4.08,  1.64,  4.06,  4.29,
               3.76,  4.  ,  3.  ,  1.  ,  4.  ,  2.55,  4.  ,  3.5 ,  5.07,
               1.5 ,  1.8 ,  2.92,  2.31,  1.68,  2.5 ,  2.  ,  2.52,  4.2 ,
               1.48,  2.  ,  2.  ,  2.18,  1.5 ,  2.83,  1.5 ,  2.  ,  3.25,
               1.25,  2.  ,  2.  ,  2.  ,  2.75,  3.5 ,  6.7 ,  5.  ,  5.  ,
               2.3 ,  1.5 ,  1.36,  1.63,  1.73,  2.  ,  2.5 ,  2.  ,  2.74,
               2.  ,  2.  ,  5.14,  5.  ,  3.75,  2.61,  2.  ,  3.5 ,  2.5 ,
               2.  ,  2.  ,  3.  ,  3.48,  2.24,  4.5 ,  1.61,  2.  , 10.  ,
               3.16,  5.15,  3.18,  4.  ,  3.11,  2.  ,  2.  ,  4.  ,  3.55,
               3.68,  5.65,  3.5 ,  6.5 ,  3.  ,  5.  ,  3.5 ,  2.  ,  3.5 ,
               4.  ,  1.5 ,  4.19,  2.56,  2.02,  4.  ,  1.44,  2.  ,  5.  ,
               2.  ,  2.  ,  4.  ,  2.01,  2.  ,  2.5 ,  4.  ,  3.23,  3.41,
               3.  ,  2.03,  2.23,  2.  ,  5.16,  9.  ,  2.5 ,  6.5 ,  1.1 ,
               3.  ,  1.5 ,  1.44,  3.09,  2.2 ,  3.48,  1.92,  3.  ,  1.58,
               2.5 ,  2.  ,  3.  ,  2.72,  2.88,  2.  ,  3.  ,  3.39,  1.47,
               3.  ,  1.25,  1.  ,  1.17,  4.67,  5.92,  2.  ,  2.  ,  1.75,
               3.  ])

theta_values = np.linspace(0, 6, 244)


def sqr_err(y_true, y_pred):
    """

    :param y_true: true values of y
    :param y_pred: predicted values of y
    :return: array of lenght original data containing mean squared error for each predictions
    """
    if len(y_true) != len(y_pred):
        raise IndexError("Mismathced array sizes, you inputted arrays with sizes {} and {}".format(len(y_true),
                                                                                                  len(y_pred)))
    else:
        length = len(y_true)

    sqrerror_out = [(y_pred[i]-y_true[i])**2 for i in range(length)]

    return np.array(sqrerror_out)


theta_value = np.linspace(0, 6, 244)

Squared_error = sqr_err(tips, theta_value)

plt.figure()
plt.plot(theta_values, Squared_error)
plt.xlabel('Theta Values')
plt.ylabel('Squared Error')
plt.show()

Squared error plot of theta and tips

Taylr Cawte
  • 582
  • 4
  • 16
  • Please look at edit in question to see my output with 244 instead. Nothing shows up, also, "theta_values" is a given so it shouldn't be touched at all. Thanks! – user3085496 Sep 23 '20 at 20:41
  • I think there is something wrong with the question then.... Comparing arrays of different sizes is not normal... right now you have the tips array of size (244,) and the theta array of size (100,). If trying to compute the MSE between the "tips" and the "theta" array you have more observations in one dataset than the other so literally those extra observations cannot be compared... i'll continue to solve the plotting problem as if there were (244,) values in the "theta" array – Taylr Cawte Sep 24 '20 at 02:33
  • @user3085496 additionally, you are right now trying to plot the **"Mean Squared Error"**, which, for a dataset, is a single value.... so if you want to plot the **single value MSE** you can just do plt.bar() to get a bar graph... if you want to plot how **squared error** changes with **theta** or some other parameter then you will need to rewrite your code slightly, but please let me know which of these objectives you are trying to achieve. – Taylr Cawte Sep 24 '20 at 02:50
  • I figured out how to do it. It's supposed to be plt.plot(theta_values, mean_squared_error(t, tips) for t in theta_values) – user3085496 Sep 24 '20 at 13:09