I have had to implement an event centric Windowing batch, with a varying number of event names.
The rule is as follows, for a certain event, every time it occurs, we sum all other events according to certain time windows.
action1 00:01
action2 00:02
action1 00:03
action3 00:04
action3 00:05
For the above dataset, it should be:
window_before: Map(action1 -> 1)
window_after: Map(action1 -> 1, action3 -> 2)
In order to achieve this, we use WindowSpec
and a custom udaf that aggregates all counters into a map. The udaf is necessary because the number of action names is completely arbitrary.
Of course at first, the UDAF used Spark's catalyst converters, which was horrendously slow.
Now I've reached what I think is a decent optimum, where I just maintain an array of keys and values with immutable lists (lower GC times, lower iterator overhead) all serialized as binary, so the Scala runtime handles boxing/unboxing and not Spark, using byte arrays instead of strings.
The problem is that some stragglers are very problematic, and the workload cannot be parallelized, unlike when we had a static number of columns and were just summing/counting numeric columns.
I tried to test another technique where I created a number of columns equal to the max cardinality of events and then aggregated back to a map, but the number of columns in the projection was simply killing spark (think a thousand columns easily).
One of the problems, is the huge stragglers, where most of the time a single partition (something like userid, app) will take 100 times longer than the median, even though everything is properly repartitioned.
Has anyone else come to a similar problem ?
Example WindowSpec
:
val windowSpec = Window
.partitionBy($"id", $"product_id")
.orderBy("time")
.rangeBetween(-30days, -1)
then
df.withColumn("over30days", myUdaf("name", "count").over(windowSpec))
A naive version of the UDAF:
class UDAF[A] {
private var zero: A = ev.zero
val dt = schemaFor[A].dataType
override def bufferSchema: StructType =
StructType(StructField("actions", MapType(StringType, dt) :: Nil)
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
name = row.get(0)
count = row.get(1)
buffer.update(name, buffer.getOrElse(name, ev.zero) + count)
}
}
My current version is less readable than the above naive version but effectively does the same, two binary arrays to circumvent CatalystConverters
.