2

Check the update at the bottom of the question

Summary: I have a dataset that does not behave linearly. I am trying to use Spark's MLlib(v1.5.2) to fit a model that behaves more as a polynomial function but I always get a linear model as a result. I don't know if it's not possible to obtain a non-linear model using a linear regression.

[TL;DR] I am trying to fit a model that represents sufficiently good the following data:

enter image description here

My code is very simple (pretty much like in every tutorial)

object LinearRegressionTest {

   def main(args: Array[String]): Unit = {
      val sc = new SparkContext("local[2]", "Linear Regression")
      val data = sc.textFile("data2.csv")
      val parsedData = data.map { line =>
         val parts = line.split(',')
         LabeledPoint(parts(1).toDouble, Vectors.dense(parts(2).toDouble))
       }.cache()

      val numIterations = 1000
      val stepSize = 0.001

      val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
      sc.stop
   }
}

The obtained results are in the right range however they are always in a monotonically increasing line. I am trying to wrap my head around it but I cannot figure it out why a better curve is not being fitted.

Any tips?

Thanks everyone

Update The problem was caused by the version of the spark and spark-ml libraries that we were using. For some reason, version 1.5.2 was not fitting a better curve even though I provided more features (squared or cubic versions of the input data). After upgrading to version 2.0.0 and switching from the deprecated LinearRegressionWithSGD to LinearRegression of the main API (not the RDD API) the algorithm behaved as expected. With this new method the model fitted the right curve.

omrsin
  • 568
  • 10
  • 18

1 Answers1

4

There is nothing unexpected here. You use linear model of form

Y = βx + ε

so fitted result will always form a line going through origin (unlike for example R, Spark by default doesn't fit intercept) and as long as the model is at least marginally sane it should be increasing to approximate distribution of data.

While details are probably off topic on StackOverflow you should start with adding more features. It should be obvious that decent approximation here has to be quadratic so let's illustrate that step-by-step. We'll start with a very rough approximation of your data:

y <- c(0.6, 0.6, 0.6, 0.6, 0.575, 0.55, 0.525, 0.475, 0.45, 0.40, 0.35, 0.30)
df <- data.frame(y=c(y, rev(y)), x=0:23)
plot(df$x, df$y)

enter image description here

Model created in Spark is more or less equivalent to:

lm1 <- lm(y ~ x + 0, df)
lines(df$x, predict(lm1, df), col='red')

enter image description here

Since it is clear that model passing trough origin is not a good let's try to add an intercept:

lm2 <- lm(y ~ x, df)
lines(df$x, predict(lm2, df), col='blue')

enter image description here

Finally we know we need to some non-linearity:

df$x2 <- df$x ** 2
lm3 <- lm(y ~ x + x2, df)
lines(df$x, predict(lm3, df), col='green')

enter image description here

Take away message here is:

  • use setIntercept(true) when creating model LinearRegressionModel,
  • add some non-linear features to the model.

    val x = arts(2).toDouble
    LabeledPoint(parts(1).toDouble, Vectors.dense(x, x*x))
    
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    Thanks @zero323, although we found out the source of the problem you pointed us in the right direction. For this reason I am going to mark your answer as the right one. I will post the real problem on an update to the question. – omrsin Aug 12 '16 at 07:04