2

I have had to implement an event centric Windowing batch, with a varying number of event names.

The rule is as follows, for a certain event, every time it occurs, we sum all other events according to certain time windows.

action1 00:01
action2 00:02
action1 00:03
action3 00:04
action3 00:05

For the above dataset, it should be:

window_before: Map(action1 -> 1)
window_after: Map(action1 -> 1, action3 -> 2)

In order to achieve this, we use WindowSpec and a custom udaf that aggregates all counters into a map. The udaf is necessary because the number of action names is completely arbitrary.

Of course at first, the UDAF used Spark's catalyst converters, which was horrendously slow.

Now I've reached what I think is a decent optimum, where I just maintain an array of keys and values with immutable lists (lower GC times, lower iterator overhead) all serialized as binary, so the Scala runtime handles boxing/unboxing and not Spark, using byte arrays instead of strings.

The problem is that some stragglers are very problematic, and the workload cannot be parallelized, unlike when we had a static number of columns and were just summing/counting numeric columns.

I tried to test another technique where I created a number of columns equal to the max cardinality of events and then aggregated back to a map, but the number of columns in the projection was simply killing spark (think a thousand columns easily).

One of the problems, is the huge stragglers, where most of the time a single partition (something like userid, app) will take 100 times longer than the median, even though everything is properly repartitioned.

Has anyone else come to a similar problem ?

Example WindowSpec:

val windowSpec = Window
    .partitionBy($"id", $"product_id")
    .orderBy("time")
    .rangeBetween(-30days, -1)

then

df.withColumn("over30days", myUdaf("name", "count").over(windowSpec))

A naive version of the UDAF:

class UDAF[A] {
    private var zero: A = ev.zero
    val dt = schemaFor[A].dataType

    override def bufferSchema: StructType =
        StructType(StructField("actions", MapType(StringType, dt) :: Nil)  

    override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
        name = row.get(0)
        count = row.get(1)
        buffer.update(name, buffer.getOrElse(name, ev.zero) + count)
    }
}

My current version is less readable than the above naive version but effectively does the same, two binary arrays to circumvent CatalystConverters.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Gepsens
  • 673
  • 4
  • 14
  • Thanks. You said _"One of the problems, is the huge stragglers"_ How do you recognize huge stragglers? How many rows are there for a single straggler? Could this be a problem with a memory? Have you seen https://github.com/apache/spark/commit/94439997d57875838a8283c543f9b44705d3a503 that's part of Spark since 2.3.0? – Jacek Laskowski Sep 05 '18 at 10:08
  • Is your case a 2-level aggregation with window aggregation first (the highest level) and `groupBy` at the lower level? If I'm not mistaken, you want to calculate actions before and after the current action group by name? Correct? – Jacek Laskowski Sep 05 '18 at 10:19
  • @JacekLaskowski Yeah that's exactly it. Of course in order to save some computing time, records are pre-aggregated before windowing when lower than the time window where we are actually considering events that are 'central' (e.g.: you want to know all activity of people who reached action b after originally doing action a within a certain time window, say 24h) – Gepsens Sep 12 '18 at 10:25
  • Window partitions here have high granularity and high variance, they may contain from 10 events to a million. – Gepsens Sep 12 '18 at 10:26
  • The output schema will be something like : id, product_id, window_-30d (a -> 1, b -> 1), window_+24h (c -> 2),... – Gepsens Sep 12 '18 at 10:27

0 Answers0