1

I am trying to run the linear regression with spark but it gives me really wrong predictions:

The data source: enter image description here

The program:

def linear_regression(data):
    """
    Run the linear regression algorithm on the data to perform the prediction
    """
    # Build the model
    model = LinearRegressionWithSGD.train(data, iterations=100, step=0.1, intercept=True)
    real_and_predicted = data.map(lambda p: (p.label, model.predict(p.features)))
    real_and_predicted=real_and_predicted.collect()
      
    return model, real_and_predicted

The result: enter image description here

Results are really wrong! A problem in my code?

Community
  • 1
  • 1
rom
  • 3,592
  • 7
  • 41
  • 71
  • 1
    Thanks @zero323 for the link. I had to change the step to `step=0.0005` in my case. Higher steps give negative and high values, while lower steps give lower `correlation coefficient`. Even with `step=0.0005`, the `correlation coefficient` is `0.67`, not really good :(. – rom Oct 20 '15 at 14:52
  • Well, if you can open your data in a spreadsheet you can easily use closed form solution :) There is no reason to use SGD. – zero323 Oct 20 '15 at 22:23
  • I can't process it in a spreadsheet unfortunately :(. What is a closed form solution? If I don't use `SGD`, what should I use? – rom Oct 21 '15 at 08:46

0 Answers0