1

Here is my function that calculates root mean squared error. However the last line cannot be compiled because of the error Type mismatch issue (expected: Double, actual: Unit). I tried many different ways to solve this issue, but still without success. Any ideas?

  def calculateRMSE(output: DStream[(Double, Double)]): Double = {
        val summse = output.foreachRDD { rdd =>
          rdd.map {
              case pair: (Double, Double) =>
                val err = math.abs(pair._1 - pair._2);
                err*err
          }.reduce(_ + _)
        }
        // math.sqrt(summse)  HOW TO APPLY SQRT HERE?
  }
Klue
  • 1,317
  • 5
  • 22
  • 43
  • @Yuval Itzchakov: I want to calculate root mean squared error for the streaming data. Maybe the way I try to approach this task is incorrect. If so, I'd like to know the correct way, assuming the the input data is of the type `DStream[(Double,Double)]`. – Klue May 02 '16 at 14:37
  • I'm still not sure of what is it you are doing. – eliasah May 02 '16 at 14:45
  • @eliasah: DStream contains pairs of Double, e.g. ((5.0, 5.2), (5.1, 5.15)...) Assume that the first element in the pair is an actual value, while the second element is a predicted value. What I need to do is to calculate an error between actual and predicted values, using Root Mean Squared Error (RMSE) metric. When I get new data in streaming, RMSE should obviously change (i.e. it should be recalculated using my function `calculateRMSE`). Is it impossible to do this? – Klue May 02 '16 at 14:49
  • and what do you want to do with this RMSE ? foreachRDD doesn't return any value – eliasah May 02 '16 at 15:02
  • @eliasah: I found the solution using `RegressionMetrics` library – Klue May 02 '16 at 15:08
  • 1
    This is basically the same as your earlier question http://stackoverflow.com/questions/36978409/how-to-use-math-sqrt-for-dstreamdouble-double. But as @eliasah says, you're iterating over each RDD, then throwing the result of the calculation away, as foreach doesn't return a value. – The Archetypal Paul May 02 '16 at 15:09

2 Answers2

2

As eliasah pointed out, foreach (and foreachRDD) don't return a value; they are for side-effects only. If you wanted to return something, you need map. Based off your second solution:

val rmse = output.map(rdd => new RegressionMetrics(rdd).rootMeanSquaredError)

It looks better if you make a little function for it:

val getRmse = (rdd: RDD) => new RegressionMetrics(rdd).rootMeanSquaredError

val rmse = output.map(getRmse)

Ignoring empty RDDs,

val rmse = output.filter(_.nonEmpty).map(getRmse)

Here is the exact same sequence as a for-comprehension. It's just syntactic sugar for map, flatMap and filter, but I thought it was much easier to understand when I was first learning Scala:

val rmse = for {
  rdd <- output
  if (rdd.nonEmpty)
} yield new RegressionMetrics(rdd).rootMeanSquaredError

And here's a function summing the errors, like your first attempt:

def calculateRmse(output: DStream[(Double, Double)]): Double = {

val getRmse = (rdd: RDD) => new RegressionMetrics(rdd).rootMeanSquaredError

output.filter(_.nonEmpty).map(getRmse).reduce(_+_)
}

The compiler's complaint about nonEmpty is actually an issue with DStream's filter method. Instead of operating on the RDDs in the DStream, filter is operating on the pairs of doubles (Double, Double) given by your DStream's type parameter.

I don't know enough about Spark to say it's a flaw, but it is very strange. Filter and most other operations over collections are typically defined in terms of foreach, but DStream implements those functions without following the same convention; its deprecated method foreach and current foreachRDD both operate over the stream's RDDs, but its other higher-order methods don't.

So my method won't work. DStream probably has a good reason for being weird (performance related?) Here's probably bad way to do it with foreach:

def calculateRmse(ds: DStream[(Double, Double)]): Double = {

  var totalError: Double = 0

  def getRmse(rdd:RDD[(Double, Double)]): Double = new RegressionMetrics(rdd).rootMeanSquaredError

  ds.foreachRDD((rdd:RDD[(Double, Double)]) => if (!rdd.isEmpty) totalError += getRmse(rdd))

  totalError
}

But it works!

  • Thanks. The function `calculateRMSE` does not compile - it says `Cannot resolve symbol rdd, + and nonEmpty`. Do you know why this happens? – Klue May 03 '16 at 08:17
  • In case of `nonEmpty` it also says `Type mismatch, expected ((Double,Double))=>Boolean, actual ((Double,Double))=>Any`. – Klue May 03 '16 at 09:38
  • Whoops. I made a mistake in `getRmse`, parentheses should have been around the left half of the function only. But the problems with `rdd` and `+` are slightly worse. I'll edit my answer. – David Prichard May 04 '16 at 16:15
0

I managed to do this task as follows:

import org.apache.spark.mllib.evaluation.RegressionMetrics

output.foreachRDD { rdd =>
  if (!rdd.isEmpty)
    {
      val metrics = new RegressionMetrics(rdd)
      val rmse = metrics.rootMeanSquaredError
      println("RMSE: " + rmse)
    }
}
Klue
  • 1,317
  • 5
  • 22
  • 43
  • Yes, I hope so:) `foreachRDD` does not return any value. I though I should accumulate MSE over different RDDs and then estimate RMSE. But the approach is different, as shown in the answer. – Klue May 02 '16 at 15:11
  • you are still not doing anything with the RMSE except printing it – eliasah May 02 '16 at 15:12