1

I have trying to build a regression model on Spark using some custom data and the intercept and weights are always nan. This is my data:

data = [LabeledPoint(0.0, [27022.0]), LabeledPoint(1.0, [27077.0]), LabeledPoint(2.0, [27327.0]), LabeledPoint(3.0, [27127.0])]

Output:

(weights=[nan], intercept=nan)  

However, if I use this dataset (taken from Spark examples), it returns a non nan weight and intercept.

data = [LabeledPoint(0.0, [0.0]), LabeledPoint(1.0, [1.0]), LabeledPoint(3.0, [2.0]),LabeledPoint(2.0, [3.0])]

Output:

(weights=[0.798729902914], intercept=0.3027117101297481) 

This my current code

model = LinearRegressionWithSGD.train(sc.parallelize(data), intercept=True)

Am I missing something? Is it because the numbers on my data are that big? It is my first time using MLlib so I might be missing some details.

Thanks

zero323
  • 322,348
  • 103
  • 959
  • 935
user3276768
  • 1,416
  • 3
  • 18
  • 28

1 Answers1

0

MLlib linear regression is SGD based therefore you need to tweak iterations and step size, see https://spark.apache.org/docs/latest/mllib-optimization.html.

I tried your custom data like this and I got some results (in scala):

val numIterations = 20
val model = LinearRegressionWithSGD.train(sc.parallelize(data), numIterations)
selvinsource
  • 1,837
  • 2
  • 17
  • 20