2

So I understand that Spark can perform iterative algorithms on single RDDs for example Logistic regression.

    val points = spark.textFile(...).map(parsePoint).cache()
    var w = Vector.random(D) // current separating plane
    for (i <- 1 to ITERATIONS) {
      val gradient = points.map(p =>
        (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
      ).reduce(_ + _)
      w -= gradient
    }

The above example is iterative because it maintains a global state w that is updated after each iteration and its updated value is used in the next iteration. Is this functionality possible in Spark streaming? Consider the same example, except now points is a DStream. In this case, you could create a new DStream that calculates the gradient with

val gradient = points.map(p =>
            (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
          ).reduce(_ + _)

But how would you handle the global state w. It seems like w would have to be a DStream too (using updateStateByKey maybe), but then its latest value would somehow need to be passed into the points map function which I don't think is possible. I don't think DStreams can communicate in this way. Am I correct, or is it possible to have iterative computations like this in Spark Streaming?

user1893354
  • 5,778
  • 12
  • 46
  • 83

2 Answers2

3

I just found out that this is quite straightforward with the foreachRDD function. MLlib actually provides models that you can train with DStreams and I found the answer in the streamingLinearAlgorithm code. It looks like you can just keep your global update variable locally in the driver and update it within the .foreachRDD so there is actually no need to transform it into a DStream itself. So you can apply this to the example I provided with something like

points.foreachRDD{(rdd,time) =>

     val gradient=rdd.map(p=> (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
     )).reduce(_ + _)

  w -= gradient

  }
user1893354
  • 5,778
  • 12
  • 46
  • 83
  • It seems like you are using points, which from MLlib, do you know of any way to do such iterative programming using normal spark streaming? I have a similar requirement, where I would like to keep some gobal state by processing iteratively across microbatches in an ordered dstream. – tsar2512 Jun 18 '15 at 10:50
-1

Hmm... you can achieve something by parallelizing your iterator and then folding on it to update your gradient.

Also... I think you should keep Spark Streaming out of it as this problem does not look like having any feature which links it to any kind Streaming requirements.

// So, assuming... points is somehow a RDD[ Point ]
val points = sc.textFile(...).map(parsePoint).cache()
var w = Vector.random(D)

// since fold is ( T )( ( T, T) => T ) => T
val temps = sc.parallelize( 1 to ITERATIONS ).map( w )

// now fold over temps.
val gradient = temps.fold( w )( ( acc, v ) => {
  val gradient = points.map( p =>
    (1 / (1 + exp(-p.y*(acc dot p.x))) - 1) * p.y * p.x
  ).reduce(_ + _)
  acc - gradient
}
sarveshseri
  • 13,738
  • 28
  • 47