2

I have a problem that should be solved outside of SQL, but due to business constraints needs to be solved within SQL.

  • So, please don't tell me to do this at data ingestion, outside of SQL, I want to, but it's not an option...


I have a stream of events, with 4 principle properties....

  • The source device
  • The event's timestamp
  • The event's "type"
  • The event's "payload" (a dreaded VARCHAR representing various data-types)


What I need to do is break the stream up in to pieces (that I will refer to as "sessions").

  • Each session is specific to a device (effectively, PARTITION BY device_id)
  • No one session may contain more than one event of the same type


To shorten the examples, I'll limit them to include just the timestamp and the event_type...

 timestamp | event_type          desired_session_id
-----------+------------        --------------------
     0     |     1                      0
     1     |     4                      0
     2     |     2                      0
     3     |     3                      0

     4     |     2                      1
     5     |     1                      1
     6     |     3                      1
     7     |     4                      1

     8     |     4                      2

     9     |     4                      3
    10     |     1                      3

    11     |     1                      4
    12     |     2                      4

An idealised final output may be to pivot the final results...

device_id | session_id | event_type_1_timestamp | event_type_1_payload |  event_type_2_timestamp | event_type_2_payload ...

(But that is not yet set in stone, but I will need to "know" which events make up a session, that their timestamps are, and what their payloads are. It is possible that just appending the session_id column to the input is sufficient, as long as I don't "lose" the other properties.)


There are:

  • 12 discrete event types
  • hundreds of thousands of devices
  • hundred of thousands of events per device
  • a "norm" of around 6-8 events per "session"
  • but sometimes a session may have just 1 or all 12

These factors mean that half-cartesian products and the like are, umm, less than desirable, but possibly may be "the only way".


I've played (in my head) with analytic functions and gaps-and-islands type processes, but can never quite get there. I always fall back to a place where I "want" some flags that I can carry forward from row to row and reset them as needed...

Pseduo-code that doesn't work in SQL...

flags = [0,0,0,0,0,0,0,0,0]
session_id = 0
for each row in stream
   if flags[row.event_id] == 0 then
       flags[row.event_id] = 1
   else
       session_id++
       flags = [0,0,0,0,0,0,0,0,0]
   row.session_id = session_id

Any SQL solution to that is appreciated, but you get "bonus points" if you can also take account of events "happening at the same time"...

If multiple events happen at the same timestamp
  If ANY of those events are in the "current" session
    ALL of those events go in to a new session
  Else
    ALL of those events go in to the "current" session

If such a group of event include the same event type multiple times
  Do whatever you like
  I'll have had enough by that point...
  But set the session as "ambiguous" or "corrupt" with some kind of flag?
MatBailie
  • 83,401
  • 18
  • 103
  • 137

3 Answers3

1

I'm not 100% sure this can be done in SQL. But I have an idea for an algorithm that might work:

  • enumerate the counts for each event
  • take the maximum count up to each point as the "grouping" for the events (this is the session)

So:

select t.*,
       (max(seqnum) over (partition by device order by timestamp) - 1) as desired_session_id
from (select t.*,
             row_number() over (partition by device, event_type order by timestamp) as seqnum
      from t
     ) t;

EDIT:

This is too long for a comment. I have a sense that this requires a recursive CTE (RBAR). This is because you cannot land at a single row and look at the cumulative information or neighboring information to determine if the row should start a new session.

Of course, there are some situations where it is obvious (say, the previous row has the same event). And, it is also possible that there is some clever method of aggregating the previous data that makes it possible.

EDIT II:

I don't think this is possible without recursive CTEs (RBAR). This isn't quite a mathematical proof, but this is where my intuition comes from.

Imagine you are looking back 4 rows from the current and you have:

1
2
1
2
1  <-- current row

What is the session for this? It is not determinate. Consider:

e     s           vs        e     s          
1     1                     2     1    <-- row not in look back
1     2                     1     1
2     2                     2     2
1     3                     1     2
2     3                     2     3
1     4                     1     3

The value depends on going further back. Obviously, this example can be extended all the way back to the first event. I don't think there is a way to "aggregate" the earlier values to distinguish between these two cases.

The problem is solvable if you can deterministically say that a given event is the start of a new session. That seems to require complete prior knowledge, at least in some cases. There are obviously cases where this is easy -- such as two events in a row. I suspect, though, that these are the "minority" of such sequences.

That said, you are not quite stuck with RBAR through the entire table, because you have device_id for parallelization. I'm not sure if your environment can do this, but in BQ or Postgres, I would:

  • Aggregate along each device to create an array of structs with the time and event information.
  • Loop through the arrays once, perhaps using custom code.
  • Reassign the sessions by joining back to the original table or unnesting the logic.
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Yeah, that's the type of approach I've mentally worked through. It doesn't work. For example ; `1, 2, 1, 3, 3` That second `3` needs to start a new session, but wouldn't with this approach. It's infuriating that this "must" be done in SQL. I might be reduced to RBAR loops *(and waiting forever)*. – MatBailie Oct 30 '18 at 11:30
  • @MatBailie . . . It is a shame that you can't just define sessions as a 30 minute lag (as they often are). – Gordon Linoff Oct 30 '18 at 11:44
  • Some time that works. It falls down when a user makes multiple attempts to log-in in quick succession (each login attempt = new session), and similar cases. My current head-wrenching thought it that *(as I know there are 12 event types)* to use `LAG()` and enumerate every possible eventuality over this and the preceding 11 records. *(`x` being the current event_type, and `1`, `2`, etc, being the lagged values: `CASE WHEN x = 1 THEN true WHEN x = 2 AND {1,2} are distinct THEN true WHEN x = 3 AND {1,2,3} are distinct THEN true ... ELSE false END AS start_of_new_session`)* – MatBailie Oct 30 '18 at 11:50
  • If Redshift had recursive CTEs, that would be nice :( – MatBailie Oct 30 '18 at 12:09
  • I'll have to read this tomorrow at work, but you may have a point that looking back n rows isn't actually sufficient, one may need to look back to the start, returning to the roll-forward RBAR loop based approaches... – MatBailie Oct 30 '18 at 19:24
1

UPD based on discussion (not checked/tested, rough idea):

WITH
trailing_events as (
    select *, listagg(event_type::varchar,',') over (partition by device_id order by ts rows between previous 12 rows and current row) as events
    from tbl
)
,session_flags as (
    select *, f_get_session_flag(events) as session_flag
    from trailing_events
)
SELECT
 *
,sum(session_flag::int) over (partition by device_id order by ts) as session_id
FROM session_flags

where f_get_session_flag is

create or replace function f_get_session_flag(arr varchar(max))
returns boolean
stable as $$
stream = arr.split(',')
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = False
for row in stream:
   if flags[row.event_id] == 0:
       flags[row.event_id] = 1
       is_new_session = False
   else:
       session_id+=1
       flags = [0,0,0,0,0,0,0,0,0,0,0,0]
       is_new_session = True
return is_new_session
$$ language plpythonu;

prev answer:

The flags could be replicated as the division remainder of running count of the event and 2:

1 -> 1%2 = 1
2 -> 2%2 = 0
3 -> 3%2 = 1
4 -> 4%2 = 0
5 -> 5%2 = 1
6 -> 6%2 = 0

and concatenated into a bit mask (similar to flags array in the pseudocode). The only tricky point is when to exactly reset all flags to zeros and initiate the new session ID but I could get quite close. If your sample table is called t and it has ts and type columns the script could look like this:

with
-- running count of the events
t1 as (
    select
     *
    ,sum(case when type=1 then 1 else 0 end) over (order by ts) as type_1_cnt
    ,sum(case when type=2 then 1 else 0 end) over (order by ts) as type_2_cnt
    ,sum(case when type=3 then 1 else 0 end) over (order by ts) as type_3_cnt
    ,sum(case when type=4 then 1 else 0 end) over (order by ts) as type_4_cnt
    from t
)
-- mask
,t2 as (
    select
     *
    ,case when type_1_cnt%2=0 then '0' else '1' end ||
     case when type_2_cnt%2=0 then '0' else '1' end ||
     case when type_3_cnt%2=0 then '0' else '1' end ||
     case when type_4_cnt%2=0 then '0' else '1' end as flags
    from t1
)
-- previous row's mask
,t3 as (
    select
     *
    ,lag(flags) over (order by ts) as flags_prev
    from t2
)
-- reset the mask if there is a switch from 1 to 0 at any position
,t4 as (
    select *
    ,case
        when (substring(flags from 1 for 1)='0' and substring(flags_prev from 1 for 1)='1')
        or (substring(flags from 2 for 1)='0' and substring(flags_prev from 2 for 1)='1')
        or (substring(flags from 3 for 1)='0' and substring(flags_prev from 3 for 1)='1')
        or (substring(flags from 4 for 1)='0' and substring(flags_prev from 4 for 1)='1')
        then '0000'
        else flags
     end as flags_override
    from t3
)
-- get the previous value of the reset mask and same event type flag for corner case 
,t5 as (
    select *
    ,lag(flags_override) over (order by ts) as flags_override_prev
    ,type=lag(type) over (order by ts) as same_event_type
    from t4
)
-- again, session ID is a switch from 1 to 0 OR same event type (that can be a switch from 0 to 1)
select
 ts
,type
,sum(case
 when (substring(flags_override from 1 for 1)='0' and substring(flags_override_prev from 1 for 1)='1')
        or (substring(flags_override from 2 for 1)='0' and substring(flags_override_prev from 2 for 1)='1')
        or (substring(flags_override from 3 for 1)='0' and substring(flags_override_prev from 3 for 1)='1')
        or (substring(flags_override from 4 for 1)='0' and substring(flags_override_prev from 4 for 1)='1')
        or same_event_type
        then 1
        else 0 end
 ) over (order by ts) as session_id
from t5
order by ts
;

You can add necessary partitions and extend to 12 event types, this code is intended to work on a sample table that you provided... it's not perfect, if you run the subqueries you'll see that flags are reset more often than needed but overall it works except the corner case for session id 2 with a single event type=4 following the end of the other session with the same event type=4, so I have added a simple lookup in same_event_type and used it as another condition for a new session id, hope this will work on a bigger dataset.

AlexYes
  • 4,088
  • 2
  • 15
  • 23
  • Thanks. I'll have a thorough read of this tomorrow when I'm back in work :) – MatBailie Oct 30 '18 at 19:28
  • Unfortunately ideas such as `reset the mask if there is a switch from 1 to 0 at any position` don't work. Some times the values might line up such that every time you see a 1 you need to start a new session. Conversely, the same scenario could mean the event 2 never causes a new session. Here's a crude example : https://dbfiddle.uk/?rdbms=postgres_8.4&fiddle=2beabdc49a93a4225bd1c3c1eb687699 – MatBailie Oct 31 '18 at 14:18
  • @MatBailie true. Well, if the max N of events per session is 12 and you're on Redshift you can try to store the last 12 types for every row using `listagg` window function and process the given array using Python UDF, returning true/false for starting the new session. Do you think that would work? – AlexYes Oct 31 '18 at 16:14
  • @MatBailie I've updated the answer with this option (roughly what can be done) – AlexYes Oct 31 '18 at 16:27
  • See Gordon's answer about looking back 12 rows. He demonstrated a case by which you can't work out the correct boundaries without looking back to the very start of the stream. My deleted answer was my starting attempt at something similar. It didn't work, for lots of reasons :) – MatBailie Oct 31 '18 at 17:17
  • I copied and pasted your code into the dbfiddle I linked earlier on *(plus the one line to pull the 'expected' column that I added)*. It's still off for similar reasons. https://dbfiddle.uk/?rdbms=postgres_8.4&fiddle=2beabdc49a93a4225bd1c3c1eb687699 – MatBailie Oct 31 '18 at 17:19
  • @MatBailie yeah I understand, I saw your answer. You pasted the first attempt again, please refer to the update on top with a Python UDF. It won't work in dbfiddle because it's specific for Redshift environment which is not available there. As I understand by being limited to SQL you meant you're limited to Redshift and want to process this without offloading the data that is too big. So a Python UDF might work for you. – AlexYes Nov 01 '18 at 13:15
  • Yeah, I suspect the "best" / "least worst" option will be to use ListAgg to make a long string of events and use python (in a UDF) to return which events are the boundaries of the sessions. I -suspect- that I can find natural gaps in the event stream to ensure list is no more than a VARCHAR(MAX) :) – MatBailie Nov 01 '18 at 13:45
  • @MatBailie but if one session can't be longer than 12 events (if you have 12 types and reset on repeating occurrence of a type its impossible to have more than 12 unique events within one session), then why should you make a very long string? – AlexYes Nov 01 '18 at 16:47
  • Gordon's answer explains that. He demonstrates why you always need to look back to the start. Not doing so gives different, incorrect, results. – MatBailie Nov 01 '18 at 21:26
0

The solution I decided to live with is effectively "don't do it in SQL" by deferring the actual sessionising to a scalar function written in python.

--
-- The input parameter should be a comma delimited list of identifiers
-- Each identified should be a "power of 2" value, no lower than 1
-- (1, 2, 4, 8, 16, 32, 64, 128, etc, etc)
--
-- The input '1,2,4,2,1,1,4' will give the output '0001010'
--
CREATE OR REPLACE FUNCTION public.f_indentify_collision_indexes(arr varchar(max))
RETURNS VARCHAR(MAX)
STABLE AS
$$
    stream = map(int, arr.split(','))
    state = 0
    collisions = []
    item_id = 1
    for item in stream:
        if (state & item) == (item):
            collisions.append('1')
            state = item
        else:
            state |= item
            collisions.append('0')
        item_id += 1

    return ''.join(collisions)
$$
LANGUAGE plpythonu;

NOTE : I wouldn't use this if there are hundreds of event types ;)


Effectively I pass in a data structure of events in sequence, and the return is a data structure of where the new sessions start.

I chose the actual data structures so make the SQL side of things as simple as I could. (Might not be the best, very open to other ideas.)

INSERT INTO
    sessionised_event_stream
SELECT
    device_id,
    REGEXP_COUNT(
        LEFT(
            public.f_indentify_collision_indexes(
                LISTAGG(event_type_id, ',')
                    WITHIN GROUP (ORDER BY session_event_sequence_id)
                    OVER (PARTITION BY device_id)
            ),
            session_event_sequence_id::INT
        ),
        '1',
        1
    ) + 1
        AS session_login_attempt_id,
    session_event_sequence_id,
    event_timestamp,
    event_type_id,
    event_data
FROM
(
    SELECT
        *,
        ROW_NUMBER()
            OVER (PARTITION BY device_id
                      ORDER BY event_timestamp, event_type_id, event_data)
                AS session_event_sequence_id
    FROM
        event_stream
)
  1. Assert a deterministic order to the events (encase of events happening at the same time, etc)
    ROW_NUMBER() OVER (stuff) AS session_event_sequence_id

  2. Create a comma delimited list of event_type_id's
    LISTAGG(event_type_id, ',') => '1,2,4,8,2,1,4,1,4,4,1,1'

  3. Use python to work out the boundaries
    public.f_magic('1,2,4,8,2,1,4,1,4,4,1,1') => '000010010101'

  4. For the first event in the sequence, count the number of 1's up to and including the first character in the 'boundaries'. For the second event in the sequence, count the number of 1's up to and including the second character in the boundaries, etc, etc.
    event 01 = 1 => boundaries = '0' => session_id = 0
    event 02 = 2 => boundaries = '00' => session_id = 0
    event 03 = 4 => boundaries = '000' => session_id = 0
    event 04 = 8 => boundaries = '0000' => session_id = 0
    event 05 = 2 => boundaries = '00001' => session_id = 1
    event 06 = 1 => boundaries = '000010' => session_id = 1
    event 07 = 4 => boundaries = '0000100' => session_id = 1
    event 08 = 1 => boundaries = '00001001' => session_id = 2
    event 09 = 4 => boundaries = '000010010' => session_id = 2
    event 10 = 4 => boundaries = '0000100101' => session_id = 3
    event 11 = 1 => boundaries = '00001001010' => session_id = 3
    event 12 = 1 => boundaries = '000010010101' => session_id = 4

    REGEXP_COUNT( LEFT('000010010101', session_event_sequence_id), '1', 1 )

The result is something that's not very speedy, but robust and still better than other options I've tried. What it "feels like" is that (perhaps, maybe, I'm not sure, caveat, caveat) if there are 100 items in a stream then LIST_AGG() is called once and the python UDF is called 100 times. I might be wrong. I've seen Redshift do worse things ;)


Pseudo code for what turns out to be a worse option.

Write some SQL that can find "the next session" from any given stream.

Run that SQL once storing the results in a temp table.
=> Now have the first session from every stream

Run it again using the temp table as an input
=> We now also have the second session from every stream

Keep repeating this until the SQL inserts 0 rows in to the temp table
=> We now have all the sessions from every stream

The time taken to calculate each session was relatively low, and was actually dominated by the overhead of making repeated requests to RedShift. It also meant that the dominant factor was "how many session are in the longest stream" (In my case, 0.0000001% of the streams were 1000x longer than the average.)

The python version is actually slower in most individual cases, but is not dominated by those annoying outliers. This meant that overall the python version completed about 10x sooner than the "external loop" version described here. It also used a bucket load more CPU resources in total, but elapsed time is the more important factor right now :)

MatBailie
  • 83,401
  • 18
  • 103
  • 137