2

While working with a linear regression model I split the data into a training set and test set. I then calculated R^2, RMSE, and MAE using the following:

lm.fit(X_train, y_train)
R2 = lm.score(X,y)
y_pred = lm.predict(X_test)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
MAE = metrics.mean_absolute_error(y_test, y_pred)

I thought that I was calculating R^2 for the entire data set (instead of comparing the training and original data). However, I learned that you must fit the model before you score it, therefore I'm not sure if I'm scoring the original data (as inputted in R2) or the data that I used to fit the model (X_train, and y_train). When I run:

lm.fit(X_train, y_train)
lm.score(X_train, y_train)

I get a different result than what I got when I was scoring X and y. So my question is are the inputs to the .score parameter compared to the model that was fitted (thereby making lm.fit(X,y); lm.score(X,y) the R^2 value for the original data and lm.fit(X_train, y_train); lm.score(X,y) the R^2 value for the original data based off the model created in .fit.) or is something else entirely happening?

Swede
  • 23
  • 1
  • 3
  • 4
    IIRC `.score` is a shortcut to run `.predict` and then calculate the accuracy. So you should only hand it `X_test` and `y_test`. – L3viathan May 13 '16 at 18:56
  • @L3viathan Spot on when I run `lm.score(X_test, y_pred)` the result is 1.0 which confirms your explanation – Swede May 13 '16 at 19:08

1 Answers1

1

fit() that only fit the data which is synonymous to train, that is fit the data means train the data. score is something like testing or predict.

So one should use different dataset for training the classifier and testing the acuracy One can do like this. X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2) clf=neighbors.KNeighborsClassifier() clf.fit(X_train,y_train) accuracy=clf.score(X_test,y_test)