Calculating the mse from a model passed in

Question

I'm trying to graph the mean squared error of my data and I'm having a little difficulty figuring out just how to do it. I know you need both the "true" value and the "predicted" value in order to get the mse, but the way my project is laid out is quite confusing.

I have a method in which I generate a model like so:

def fit_curve(X, y, degree):
    poly_features = PolynomialFeatures(degree = degree)
    x_poly = poly_features.fit_transform(X)
    linreg = LinearRegression()
    model = linreg.fit(x_poly, y)
    return model

This returns a model that's already trained.

Then, I'm supposed to find the mean squared error for said model. I'm not sure how I'm supposed to do this since the model has already been trained without returning the predicted values. Right now my method that calculates mse is:

def mse(X, y, degree, model):
    poly_features = PolynomialFeatures(degree = degree)
    linreg = LinearRegression()
    x_poly = poly_features.fit_transform(X)
    linreg.fit(x_poly, y)
    y_predict = linreg.predict(x_poly)
    mse = mean_squared_error(y_predict, y)
    return mse

I feel like a lot of the code I use in mse is very redundant when compared to fit_curve. Unfortunately, guidelines say that this is the way I need to do it (with mse taking X, y, degree, and model.

I think it's also worth noting that my current mse works correctly until about 13-14 degrees, where the answer it generates on the graph does not match the solution I was given. I'm not sure why it's not working perfectly, because I thought this was the right idea.

score 1 · Accepted Answer · answered Mar 11 '20 at 08:36

Things should be done in that way:

1) Split your X and y into train and test sets. You can use train_test_split for that. You can choose your test_size (I put 0.33 as an example) and random_state (this one helps with reproducibility).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

2) Fit your model (hereby, a linear regression) using X_train and y_train. You have some feature generation method (polynomial one), it's great. Use it with the training data.

poly_features = PolynomialFeatures(degree=degree)
linreg = LinearRegression()
X_train_poly = poly_features.fit_transform(X_train)
linreg.fit(X_train_poly, y_train)

3) Evaluate your fitted model by looking whether it can correctly predict on unseen data (X_test). For that, you can indeed use mean_squared_error with model.predict(X_test) and y_test. Caution, you must apply the same transformation to X_test than what you did for X_train (that's why we first use poly_features.transform)

X_test_poly = poly_features.transform(X_test)
print(mean_squared_error(linreg.predict(X_test_poly), y_test))

Hope that helps.

This makes perfect sense, thank you! I was super confused about the train_test_split function, so I figured I could do what I wanted without it. You really cleared it up for me though! :-) Thank you! — larn, Mar 11 '20 at 08:53
Great! And I think it's also really important to emphasize that the pre-processing (standardization, feature expansion) should be **fitted** on X_train **only**, but **applied to** X_test at prediction time. Otherwise, if you fit the pre-processing step to full X, you're basically leaking information from test distribution to train distribution... that's cheating! — arnaud, Mar 11 '20 at 09:02

Calculating the mse from a model passed in

1 Answers1