I use Spark Streaming to process imported data. The import data is stored in a DStream
. Further, I have the classes Creation
and Update
which hold a Foo
object.
One of the tasks I want to accomplish, is change-detection.
So I join 2 rdds (one holds the batch of data that is processed, the other one holds the current state) currentState
is empty initially.
val stream: DStream[Foo]
val currentState: RDD[Foo]
val changes = stream
.transform { batch =>
batch.leftouterJoin(currentState) {
case(objectNew, Some(objectOld)) => Update(objectNew)
case(objectNew, None) => Creation(objectNew)
}
}
currentState = currentState.fullOuterJoin(changes).map {
case (Some(foo), None) => foo
case (_, Some(change)) => change.foo
}
}.cache()
Afterwards I filter out the updates.
changes.filter(!_.isInstanceOf[Update])
I now import the same data twice. Since the state is empty initially, the result set of the first import consists of Creation
objects while the second one only results in Update
objects. So the second result set changes
is empty.
In this case I notice a massive performance decrease. It works fine if I leave out the filter.
I can't imagine this is intended behavior but maybe it is a problem with Spark Computation Internals. Can anyone explain why this happens?