2

Is it possible to achieve stateful stream processing with Spark DataFrame API? The first thing I'd like to try is deduplicate the stream. DStream has mapWithState method, but in order to convert it to DataFrames, I have to use foreachRDD:

dStream foreachRDD { rdd =>
  val df = spark.read.json(rdd)
  // Need to join with the state somehow
  val unique = deduplicate(df)
  val result = myFancyProcessingMethod(unique)
  publish(result)
}

But now we've in the realm which has no notion of the stream (and, therefore, it's state), and I'm stuck.

The only solution I could think of is to dedupe the original DStream and only then convert it to DataFrames. That has several drawbacks: I have to parse the JSON twice, the algorithm itself could be implemented in DataFrames more efficiently (if it was a one-off task) etc. Is there any other way?

lizarisk
  • 7,562
  • 10
  • 46
  • 70

0 Answers0