2

I'm looking for the best solution to accumulate the last N number of messages in a Spark DStream. I'd also like to specify the number of messages to retain.

For example, given the following stream, I'd like to retain the last 3 elements:

Iteration  New message  Downstream
1          A            [A]
2          B            [A, B]
3          C            [A, B, C]
4          D            [B, C, D]

So far I'm looking at the following methods on DStream:

  1. updateStateByKey: given that all messages have the same key I can do this. But looks a bit odd why this needs to know about the key at all.
  2. mapWithState: the API in Scala is just too tedious for such a simple thing
  3. window: doesn't seem to do this job, also it needs a time value for windowing instead of the last number of elements
  4. Accumulators: not really used yet Accumulators in Spark docs

What's the best solution to achieve this?

user278530
  • 83
  • 2
  • 11

1 Answers1

1

mapWithState is exactly what you need, and it's definitely not too tedious:

case class Message(x: String)
def statefulTransformation(key: Int,
                           value: Option[Message],
                           state: State[mutable.MutableList[Message]]): Option[Message] = {
  def updateState(value: Message): Message = {
    val updatedList =
      state
        .getOption()
        .map(list => if (list.size > 3) list.drop(1) :+ value else list :+ value)
      .getOrElse(mutable.MutableList(value))

    state.update(updatedList)
    value
  }

  value.map(updateState)
}

And now all you need is:

val stateSpec = StateSpec.function(statefulTransformation _)
dStream.mapWithState(stateSpec)

Side note - I used mutable.MutableList for the constant time append.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321