0

I have created a multiple linear regression model on some data (housing prices for the Seattle county) with GraphLab Create and one with Scikit-Learn. Test and training set are chose at random but I've used the same split (80/20). However, the results are very different.

The mean error for the GraphLab model is 106254.49 while for the Scikit-Learn model it's 168980.44

The code to create the GraphLab model is from an online course, so I assume it's correct. The one I wrote for the Scikit model is:

model = LinearRegression().fit(train_features,train_target)
test_predictions = model.predict(test_features)
errors = abs(test_predictions - test_target)

I understand that the data for the two models is not exactly the same because both samples were chosen at random, but with a training set size of about 17k rows and a test set size of about 4k rows I wouldn't expect a big difference.

Any suggestions? Am I doing something wrong with the Scikit linear regression?

In essence I would like to be able to replicate the GraphLab model using Scikit, expecting very similar performances.

Thanks

CCSwift
  • 37
  • 5
  • 1
    Even with this sample size, better specify random_state in your train_test_split function to make sure your comparisons are accurate. Is the difference still that significant after using the same random_state? – MaximeKan Jul 13 '19 at 16:14
  • @MaximeKan I've managed to rebuild the linear regression model using scikit-learn and exactly the same data. The average error is now 167245, so barely any lower than the first attempt using different data and still very different from the 106254 I get from GraphLab Create. Any idea? – CCSwift Jul 14 '19 at 18:02

0 Answers0