Linear regression with Spark MLlib only returns monotonic predictions

Question

Check the update at the bottom of the question

Summary: I have a dataset that does not behave linearly. I am trying to use Spark's MLlib(v1.5.2) to fit a model that behaves more as a polynomial function but I always get a linear model as a result. I don't know if it's not possible to obtain a non-linear model using a linear regression.

[TL;DR] I am trying to fit a model that represents sufficiently good the following data:

My code is very simple (pretty much like in every tutorial)

object LinearRegressionTest {

   def main(args: Array[String]): Unit = {
      val sc = new SparkContext("local[2]", "Linear Regression")
      val data = sc.textFile("data2.csv")
      val parsedData = data.map { line =>
         val parts = line.split(',')
         LabeledPoint(parts(1).toDouble, Vectors.dense(parts(2).toDouble))
       }.cache()

      val numIterations = 1000
      val stepSize = 0.001

      val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
      sc.stop
   }
}

The obtained results are in the right range however they are always in a monotonically increasing line. I am trying to wrap my head around it but I cannot figure it out why a better curve is not being fitted.

Any tips?

Thanks everyone

Update The problem was caused by the version of the spark and spark-ml libraries that we were using. For some reason, version 1.5.2 was not fitting a better curve even though I provided more features (squared or cubic versions of the input data). After upgrading to version 2.0.0 and switching from the deprecated LinearRegressionWithSGD to LinearRegression of the main API (not the RDD API) the algorithm behaved as expected. With this new method the model fitted the right curve.

score 4 · Accepted Answer · answered Aug 05 '16 at 17:51

There is nothing unexpected here. You use linear model of form

Y = βx + ε

so fitted result will always form a line going through origin (unlike for example R, Spark by default doesn't fit intercept) and as long as the model is at least marginally sane it should be increasing to approximate distribution of data.

While details are probably off topic on StackOverflow you should start with adding more features. It should be obvious that decent approximation here has to be quadratic so let's illustrate that step-by-step. We'll start with a very rough approximation of your data:

y <- c(0.6, 0.6, 0.6, 0.6, 0.575, 0.55, 0.525, 0.475, 0.45, 0.40, 0.35, 0.30)
df <- data.frame(y=c(y, rev(y)), x=0:23)
plot(df$x, df$y)

Model created in Spark is more or less equivalent to:

lm1 <- lm(y ~ x + 0, df)
lines(df$x, predict(lm1, df), col='red')

Since it is clear that model passing trough origin is not a good let's try to add an intercept:

lm2 <- lm(y ~ x, df)
lines(df$x, predict(lm2, df), col='blue')

Finally we know we need to some non-linearity:

df$x2 <- df$x ** 2
lm3 <- lm(y ~ x + x2, df)
lines(df$x, predict(lm3, df), col='green')

Take away message here is:

use setIntercept(true) when creating model LinearRegressionModel,

add some non-linear features to the model.

val x = arts(2).toDouble
LabeledPoint(parts(1).toDouble, Vectors.dense(x, x*x))

Thanks @zero323, although we found out the source of the problem you pointed us in the right direction. For this reason I am going to mark your answer as the right one. I will post the real problem on an update to the question. — omrsin, Aug 12 '16 at 07:04

Linear regression with Spark MLlib only returns monotonic predictions

1 Answers1