7

I am trying to generate R square value from cross_validation.cross_val_score which is about 0.35 and then I applied the model into the same train dataset and used "r2_score" function to generate R square, which is about 0.87. I wonder I was given two results with so much difference. Any help will be appreciated. The codes are attached below.

num_folds = 2
num_instances = len(X_train)
scoring ='r2'

models = []
models.append(('RF', RandomForestRegressor()))
results = []
names = []
for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold,
    scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

model.fit(X_train, Y_train)
train_pred=model.predict(X_train)
r2_score(Y_train, train_pred)
lionking19063
  • 79
  • 1
  • 7
  • 2
    In the `cross_val_score` the scores returned are calculated on the test data of each fold and then averaged. In the second part, you are calculating the scores on the training data, which in most cases will have higher value (because the model has been trained on that data). – Vivek Kumar Aug 13 '18 at 06:10
  • Thank you. But why there is so much difference, 0.35 vs. 0.87? – lionking19063 Aug 13 '18 at 13:45
  • 1
    Maybe your model is overfitting too much and hence training score is much higher than test score. It depends on the data – Vivek Kumar Aug 13 '18 at 13:48

1 Answers1

1

Actually they are the same. In your case, you have used r2 for cross validation score. I mean, you divided the train set into 2 part (num_folds = 2) and r2 were calculated for these two set and then averaged cv_results.mean(). To sum up, you have used r2 for validation score, whereas r2_score to evaluate performance of model on whole train set.