Spark mllib LinearRegression weird result

Question

Starting from an example I was trying to do LinearRegression. The problem is that I got the wrong result. As interceptor I should have: 2.2.

I tried to add .optimizer.setStepSize(0.1) found on another post, but still get a weird result. Suggestion?

This is the set of data

1,2
2,4
3,5
4,4
5,5

Code:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors

object linearReg {
  def main(args: Array[String]) {
    StreamingExamples.setStreamingLogLevels()
    val sparkConf = new SparkConf().setAppName("linearReg").setMaster("local")
    val sc=new SparkContext(sparkConf)
    // Load and parse the data
    val data = sc.textFile("/home/daniele/dati.data")
    val parsedData = data.map { line =>
      val parts = line.split(',')
      LabeledPoint(parts(0).toDouble, Vectors.dense(Array(1.0)++parts(1).split(' ').map(_.toDouble)))
    }.cache()
    parsedData.collect().foreach(println)
    // Building the model
    val numIterations = 1000
    val model = LinearRegressionWithSGD.train(parsedData, numIterations)
    println("Interceptor:"+model.intercept)
    // Evaluate model on training examples and compute training error
    val valuesAndPreds = parsedData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, prediction)
    }
    valuesAndPreds.collect().foreach(println)
    val MSE = valuesAndPreds.map { case (v, p) => math.pow((v - p), 2) }.mean()
    println("training Mean Squared Error = " + MSE)

    // Save and load model
    model.save(sc, "myModelPath")
    val sameModel = LinearRegressionModel.load(sc, "myModelPath")
  }
}

Result:

weights: [-4.062601003207354E25], intercept: -9.484399253945647E24

Update -Used .train method -Added the 1.0 as addendum for the intercept. Data appear in this way with 1.0 addendum

Ramón J Romero y Vigil · Accepted Answer · 2015-11-18T16:40:02.250

2

You are using run which means that the data you are passing in is being interpreted as "configured parameters" and not features to be regressed.

The docs contain good examples of training then running your model:

//note the "train" instead of "run"
val numIterations = 1000
val model =  LinearRegressionWithSGD.train(parsedData, numIterations)

The result is a more accurate weight:

scala> model.weights
res4: org.apache.spark.mllib.linalg.Vector = [0.7674418604651163]

If you want to add an intercept just place a 1.0 value as a feature in your dense Vector. Modify your example code:

...
LabeledPoint(Parts(0).toDouble, Vectors.dense(Array(1.0) ++ parts(1).split(' ').map(_.toDouble)))
...

The first feature is then your intercept.

edited Nov 18 '15 at 16:40

answered Nov 18 '15 at 15:45

Ramón J Romero y Vigil

17,373
7
77
125

modified to match your requirements. – Ramón J Romero y Vigil Nov 18 '15 at 16:13
It gives me java.lang.NumberFormatException: empty String – Nov 18 '15 at 17:04
That is coming from your parsing code not the addendum I made regarding the 1.0 as your intercept feature... – Ramón J Romero y Vigil Nov 18 '15 at 17:07
Yes that one it was my problem sorry. Anyway i still got 0 as intercept. i put my result on the main post. – Nov 18 '15 at 18:20
You will no longer have an explicit intercept in the model, your first feature will be 1.0 which is the intercept and your original feature will be the second. this is a common ML trick for artificially creating an intercept. – Ramón J Romero y Vigil Nov 18 '15 at 18:22

Spark mllib LinearRegression weird result

1 Answers1