1

I have a data set that lends itself to window functions, 3M+ rows that once ranked can be partitioned into groups of ~20 or less rows. Here is a simplified example:

id    date1    date2    type          rank
171   20090601 20090601 attempt       1
171   20090701 20100331 trial_fail    2
171   20090901 20091101 attempt       3
171   20091101 20100201 attempt       4
171   20091201 20100401 attempt       5
171   20090601 20090601 fail          6
188   20100701 20100715 trial_fail    1
188   20100716 20100730 trial_success 2
188   20100731 20100814 trial_fail    3
188   20100901 20100901 attempt       4
188   20101001 20101001 success       5

The data is ranked by id and date1, and the window created with:

Window.partitionBy("id").orderBy("rank")

In this example the data has already been ranked by (id, date1). I could also work on the unranked data and rank it within Spark.

I need to implement some logic on these rows, for example, within a window:

1) Identify all rows that end during a failed trial (i.e. a row's date2 is between date1 and date2 of any previous row within the same window of type "trial_fail").

2) Identify all trials after a failed trial (i.e. any row with type "trial_fail" or "trial success" after a row within the same window of type "trial_fail").

3) Identify all attempts before a successful attempt (i.e. any row with type "attempt" with date1 earlier than date1 of another later row of type "success").

The exact logic of these conditions is not important to my question (and there will be other different conditions), what's important is that the logic depends on values in many rows in the window at once. This can't be handled by the simple Spark SQL functions like first, last, lag, lead, etc. and isn't as simple as the typical example of finding the largest/smallest 1 or n rows in the window.

What's also important is that the partitions don't depend on one another so this seems like this a great candidate for Spark to do in parallel, 3 million rows with 150,000 partitions of 20 rows each, in fact I wonder if this is too many partitions.

I can implement this with a loop something like (in pseudocode):

for i in 1..20:
    for j in 1..20:
        // compare window[j]'s type and dates to window[i]'s etc
        // add a Y/N flag to the DF to identify target rows

This would require 400+ iterations (the choice of 20 for the max i and j is an educated guess based on the data set and could actually be larger), which seems needlessly brute force.

However I am at a loss for a better way to implement it. I think this will essentially collect() in the driver, which I suppose might be ok if it is not much data. I thought of trying to implement the logic as sub-queries, or by creating a series of sub-DF's each with a subset or reduction of data.

If anyone is aware of any API's or techniques that I am missing any info would be appreciated.

Edit: This is somewhat related:

Spark SQL window function with complex condition

Uncle Long Hair
  • 2,719
  • 3
  • 23
  • 33

0 Answers0