0

I have one question, I'm trying to implement KFold and cross_val_score. My goal is to calculate mean_squared_errorand for this purpose I used the following code:

from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score

x = np.random.random((10000,20))
y = np.random.random((10000,1))

x_train = x[7000:]
y_train = y[7000:]

x_test = x[:7000]
y_test = y[:7000]

Model = linear_model.LinearRegression()
Model.fit(x_train,y_train)

y_predicted  = Model.predict(x_test)

MSE = mean_squared_error(y_test,y_predicted)
print(MSE)

kfold = KFold(n_splits = 100, random_state = None, shuffle = False)

results = cross_val_score(Model,x,y,cv=kfold, scoring='neg_mean_squared_error')
print(results.mean())

I think it's all right here, I got the following results:

Results: 0.0828856459279 and -0.083069435946

But when I try to do this on some other example (datas from Kaggle House Prices), it does not work properly, at least I think so..

train = pd.read_csv('train.csv')

Insert missing values...
...

train = pd.get_dummies(train)
y = train['SalePrice']
train = train.drop(['SalePrice'], axis = 1)

x_train = train[:1000].values.reshape(-1,339)
y_train = y[:1000].values.reshape(-1,1)
y_train_normal = np.log(y_train)

x_test = train[1000:].values.reshape(-1,339)
y_test = y[1000:].values.reshape(-1,1)

Model = linear_model.LinearRegression()
Model.fit(x_train,y_train_normal)

y_predicted = Model.predict(x_test)
y_predicted_transform = np.exp(y_predicted)

MSE = mean_squared_error(y_test, y_predicted_transform)
print(MSE)

kfold = KFold(n_splits = 10, random_state = None, shuffle = False)

results = cross_val_score(Model,train,y, cv = kfold, scoring = "neg_mean_squared_error")
print(results.mean())

Here I get the following results: 0.912874946869 and -6.16986926564e+16

Apparently, the mean_squared_error calculated 'manually' is not the same as the mean_squared_error calculated by the help of KFold.

I'm interested in where I made a mistake?

desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

0

The discrepancy is because, in contrast to your first approach (training/test set), in your CV approach you use the unnormalized y data for fitting the regression, hence your huge MSE. To get comparable results, you should do the following:

y_normal = np.log(y)
y_test_normal = np.log(y_test)

MSE = mean_squared_error(y_test_normal, y_predicted) # NOT y_predicted_transform
results = cross_val_score(Model, train, y_normal, cv = kfold, scoring = "neg_mean_squared_error")
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • I tried it, but it still does not have the same results.. `926.857139601` and `-7.68871709526e+13` Thanks anyway. – Dejan Samardžija Feb 09 '18 at 18:07
  • @DejanSamardžija difficult to say exactly, since you don't show even a sample of your data; it may be the case that you need to perform some of the `reshape` operations in the CV case, too. In any case, ensure that the *format* of the data is the same in both cases – desertnaut Feb 09 '18 at 20:22