5

I use Spark Streaming to process imported data. The import data is stored in a DStream. Further, I have the classes Creation and Update which hold a Foo object.

One of the tasks I want to accomplish, is change-detection. So I join 2 rdds (one holds the batch of data that is processed, the other one holds the current state) currentState is empty initially.

val stream: DStream[Foo]
val currentState: RDD[Foo]

val changes = stream
    .transform { batch =>
        batch.leftouterJoin(currentState) {
            case(objectNew, Some(objectOld)) => Update(objectNew)
            case(objectNew, None) => Creation(objectNew)
        }
    }

currentState = currentState.fullOuterJoin(changes).map {
        case (Some(foo), None) => foo
        case (_, Some(change)) => change.foo
    }
}.cache()

Afterwards I filter out the updates.

changes.filter(!_.isInstanceOf[Update])

I now import the same data twice. Since the state is empty initially, the result set of the first import consists of Creation objects while the second one only results in Update objects. So the second result set changes is empty. In this case I notice a massive performance decrease. It works fine if I leave out the filter.

I can't imagine this is intended behavior but maybe it is a problem with Spark Computation Internals. Can anyone explain why this happens?

  • 2
    Have you tried modelling your problem with the new `mapWithState` stateful transformation? I think it fits your problem space well and it will be much more performant than a leftOuterJoin. – maasg Apr 20 '16 at 17:33
  • @maasg I already tried `mapWithState`. Since I process a huge amount of entries, the state in `mapWithState` increases to the point that the periodical checkpointing takes too much time. [This question is about that problem](https://stackoverflow.com/questions/36042295/spark-streaming-mapwithstate-seems-to-rebuild-complete-state-periodically) I use Indexed-RDD because it supports incremental checkpointing. – Sebastian.Ernst Apr 21 '16 at 07:52
  • 1
    I don't see the actual part where you change the state of currentState RDD. Could you upload it? Given the things you described my only guess can be that maybe you forgot to .cache currentState which will result in higher processing times as your stream add more operation on top of it. – Michael Kopaniov Apr 21 '16 at 12:19
  • @MichaelKopaniov you are completely right. I have added the relevant codelines to my question. However I used the `.cache()`method so the anomaly should have another reason. – Sebastian.Ernst Apr 21 '16 at 14:19

0 Answers0