9

I was trying to predict a label for every row in a DataFrame, but without using the LinearRegressionModel's transform method, due to ulterior motives, instead I was trying to compute it manually by using the classic formula 1 / (1 + e^(-hθ(x))), note that I copied the code from Apache Spark's repository and copied almost everything from private object BLAS into a public version of it. PD: I don't use any regParam, I just fitted the model.

//Notice that I had to obtain intercept, and coefficients from my model
val intercept = model.intercept
val coefficients = model.coefficients

val margin: Vector => Double = (features) => {
  BLAS.dot(features, coefficients) + intercept
}

val score: Vector => Double = (features) => {
  val m = margin(features)
  1.0 / (1.0 + math.exp(-m))
}

After defining such functions, and obtaining model's parameters I created a UDF to compute the prediction (it receives the same features as a DenseVector), later I compare my preductions to real model's ones and they are very different! So what did I miss? What am I doing wrong?

val predict = udf((v: DenseVector) => {
  val recency = v(0)
  val frequency = v(1)
  val tp = score(new DenseVector(Array(recency, frequency)))
  new DenseVector(Array(tp, 1 - tp))
})

// model's predictions
val xf = model.transform(df)

df.select(col("id"), predict(col("features")).as("myprediction"))
  .join(xf, df("id") === xf("id"), "inner")
  .select(df("id"), col("probability"), col("myprediction"))
  .show

+----+--------------------+--------------------+
|  id|         probability|        myprediction|
+----+--------------------+--------------------+
|  31|[0.97579780436514...|[0.98855386037790...|
| 231|[0.97579780436514...|[0.98855386037790...|
| 431|[0.69794428333266...|           [1.0,0.0]|
| 631|[0.97579780436514...|[0.98855386037790...|
| 831|[0.97579780436514...|[0.98855386037790...|
|1031|[0.96509616791398...|[0.99917463322937...|
|1231|[0.96509616791398...|[0.99917463322937...|
|1431|[0.96509616791398...|[0.99917463322937...|
|1631|[0.94231815700848...|[0.99999999999999...|
|1831|[0.96509616791398...|[0.99917463322937...|
|2031|[0.96509616791398...|[0.99917463322937...|
|2231|[0.96509616791398...|[0.99917463322937...|
|2431|[0.95353743438055...|           [1.0,0.0]|
|2631|[0.94646924057674...|           [1.0,0.0]|
|2831|[0.96509616791398...|[0.99917463322937...|
|3031|[0.96509616791398...|[0.99917463322937...|
|3231|[0.95971207153567...|[0.99999999999996...|
|3431|[0.96509616791398...|[0.99917463322937...|
|3631|[0.96509616791398...|[0.99917463322937...|
|3831|[0.96509616791398...|[0.99917463322937...|
+----+--------------------+--------------------+

EDIT

I even tried defining such functions inside the udf, and didn't work.

def predict(coefficients: Vector, intercept: Double) = {
  udf((v: DenseVector) => {
    def margin(features: Vector, coefficients: Vector, intercept: Double): Double = {
      BLAS.dot(features, coefficients) + intercept
    }

    def score(features: Vector, coefficients: Vector, intercept: Double): Double = {
      val m = margin(features, coefficients, intercept)
      1.0 / (1.0 + math.exp(-m))
    }

    val recency = v(0)
    val frequency = v(1)
    val tp = score(new DenseVector(Array(recency, frequency)), coefficients, intercept)
    new DenseVector(Array(tp, 1 - tp))
  })
}
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
  • At first glance it doesn't look there is anything wrong with the formula and Spark seems to return expected results but your code smells. Since val evaluates when defined and you get coefficients from outer scope it is most likely using something else than you expect. – zero323 May 05 '16 at 11:16
  • @zero323 I computed the dot product using model's intercept and coefficients with some features and they are computed exactly as in my code, indeed the result is exactly the same as in my formula but differs from the spark's result. However, I will define those as functions, and pass other values as arguments, just in case. – Alberto Bonsanto May 05 '16 at 13:12

1 Answers1

1

It's very embarrassing but actually the problem was because I used a Pipeline and added a MinMaxScaler as a stage, so the dataset was scaled before the model's training, so both parameters coefficients and intercept were tied to that scaled data, so when I computed the prediction using them, the result was totally biased. Therefore, to solve this I just unnormalized the training dataset so I could get those coefficients and the intercept. After I re-executed the code, I got the same result as Spark. On the other hand, I listened to @zero323's and moved margin and score definitions to inside the udf's first lambda declaration.

Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93