I have a problem that should be solved outside of SQL, but due to business constraints needs to be solved within SQL.
- So, please don't tell me to do this at data ingestion, outside of SQL, I want to, but it's not an option...
I have a stream of events, with 4 principle properties....
- The source device
- The event's timestamp
- The event's "type"
- The event's "payload" (a dreaded VARCHAR representing various data-types)
What I need to do is break the stream up in to pieces (that I will refer to as "sessions").
- Each session is specific to a device (effectively,
PARTITION BY device_id
) - No one session may contain more than one event of the same type
To shorten the examples, I'll limit them to include just the timestamp and the event_type...
timestamp | event_type desired_session_id
-----------+------------ --------------------
0 | 1 0
1 | 4 0
2 | 2 0
3 | 3 0
4 | 2 1
5 | 1 1
6 | 3 1
7 | 4 1
8 | 4 2
9 | 4 3
10 | 1 3
11 | 1 4
12 | 2 4
An idealised final output may be to pivot the final results...
device_id | session_id | event_type_1_timestamp | event_type_1_payload | event_type_2_timestamp | event_type_2_payload ...
(But that is not yet set in stone, but I will need to "know" which events make up a session, that their timestamps are, and what their payloads are. It is possible that just appending the session_id column to the input is sufficient, as long as I don't "lose" the other properties.)
There are:
- 12 discrete event types
- hundreds of thousands of devices
- hundred of thousands of events per device
- a "norm" of around 6-8 events per "session"
- but sometimes a session may have just 1 or all 12
These factors mean that half-cartesian products and the like are, umm, less than desirable, but possibly may be "the only way".
I've played (in my head) with analytic functions and gaps-and-islands type processes, but can never quite get there. I always fall back to a place where I "want" some flags that I can carry forward from row to row and reset them as needed...
Pseduo-code that doesn't work in SQL...
flags = [0,0,0,0,0,0,0,0,0]
session_id = 0
for each row in stream
if flags[row.event_id] == 0 then
flags[row.event_id] = 1
else
session_id++
flags = [0,0,0,0,0,0,0,0,0]
row.session_id = session_id
Any SQL solution to that is appreciated, but you get "bonus points" if you can also take account of events "happening at the same time"...
If multiple events happen at the same timestamp
If ANY of those events are in the "current" session
ALL of those events go in to a new session
Else
ALL of those events go in to the "current" session
If such a group of event include the same event type multiple times
Do whatever you like
I'll have had enough by that point...
But set the session as "ambiguous" or "corrupt" with some kind of flag?