So I understand that Spark can perform iterative algorithms on single RDDs for example Logistic regression.
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
The above example is iterative because it maintains a global state w
that is updated after each iteration and its updated value is used in the next iteration. Is this functionality possible in Spark streaming? Consider the same example, except now points
is a DStream. In this case, you could create a new DStream that calculates the gradient with
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
But how would you handle the global state w
. It seems like w
would have to be a DStream too (using updateStateByKey
maybe), but then its latest value would somehow need to be passed into the points
map function which I don't think is possible. I don't think DStreams can communicate in this way. Am I correct, or is it possible to have iterative computations like this in Spark Streaming?