4

I have a list of activities(A-start to H-end) for certain events. These can occur in any order, any number of times and can restart as well. I need to identify the blocks of activities within an event.

Eg: A B C D E F G H B C D H C D E F H E F G H

It starts only once (A) but ends multiple times

Need to number these activities to identify sets (How many times it ended)

output: 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 4 4 4 4 5

This helps me identify that the event ended (H) 5-1 = 4 times

ScaisEdge
  • 131,976
  • 10
  • 91
  • 107
Neha
  • 233
  • 1
  • 3
  • 11
  • What is the database your are working on (MSSQL, Oracle, MYSQL, ...)? – etsa Mar 16 '17 at 10:22
  • Input data you wrote as example are different records or is a single string column? Similarly, the output can be different records or a single string column? – etsa Mar 16 '17 at 10:24
  • You need to have a column that specifies the ordering. SQL tables represent *unordered* sets. There is no inherent ordering of the values. I also assume that the values are actually on separate rows. – Gordon Linoff Mar 16 '17 at 10:24
  • Database : Vertica – Neha Mar 16 '17 at 10:31
  • The activities (A - H) are records and output needs to be corresponding to the records – Neha Mar 16 '17 at 10:32

2 Answers2

2

It looks like you want to count the number of "H"s and "A"s before a given value. This requires having a column that specifies the ordering. Let me assume this column is called id.

Then, you can do this with window functions:

select t.*,
       sum(case when col = 'H' then 1 else 0 end) over (partition by grp order by id) + 1 as output
from (select t.*,
             sum(case when col = 'A' then 1 else 0 end) over (order by id) as grp
      from t
     ) t;

The subquery defines the "activity" groups, by doing a cumulative sum of "A"s. The outer query then defines the "event" groups by doing a cumulative sum of "E"s.

To be honest, I cannot tell if the "H" is part of the preceding value or the next value. If the next value then the query can use a window clause or a slight tweak to the logic:

       (sum(case when col = 'H' then 1 else 0 end) over (partition by grp order by id) +
        (case when col = 'H' then 0 else 1 end)
       ) as output
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

If your events are series of events in time - try to play with the MATCH() clause and its dependent functions event_name(), pattern_id() and match_id() .

I just created a timeseries out of your input letters, spaced by one-hour intervals, and applied a MATCH() clause. If the PATTERN pat AS () clause uncannily reminds you of a grep expression, that's the way it works.

Just look at the query's output - and imagine how many interesting things you could do with the pattern_id-s and the match_id-s that you get- grouping by them, for example, in subsequent SELECT-s ...

WITH 
s(tm,event) AS (
          SELECT TIME '00:00:00','A'
UNION ALL SELECT TIME '01:00:00','B'
UNION ALL SELECT TIME '02:00:00','C'
UNION ALL SELECT TIME '03:00:00','D'
UNION ALL SELECT TIME '04:00:00','E'
UNION ALL SELECT TIME '05:00:00','F'
UNION ALL SELECT TIME '06:00:00','G'
UNION ALL SELECT TIME '07:00:00','H'
UNION ALL SELECT TIME '08:00:00','B'
UNION ALL SELECT TIME '09:00:00','C'
UNION ALL SELECT TIME '10:00:00','D'
UNION ALL SELECT TIME '11:00:00','H'
UNION ALL SELECT TIME '12:00:00','C'
UNION ALL SELECT TIME '13:00:00','D'
UNION ALL SELECT TIME '14:00:00','E'
UNION ALL SELECT TIME '15:00:00','F'
UNION ALL SELECT TIME '16:00:00','H'
UNION ALL SELECT TIME '17:00:00','E'
UNION ALL SELECT TIME '18:00:00','F'
UNION ALL SELECT TIME '19:00:00','G'
UNION ALL SELECT TIME '20:00:00','H'
)
SELECT
  *
, event_name()
, pattern_id()
, match_id()
FROM s
MATCH(
  PARTITION BY 1 -- nothing to partition by
  ORDER BY tm
  DEFINE  
    START_ev AS (event='A')
  , any_ev   AS (event NOT IN ('A','H'))
  , END_ev   AS (event='H')
  PATTERN pat AS (start_ev* any_ev+ end_ev)
  ROWS MATCH FIRST EVENT
);

tm      |event|event_name|pattern_id|match_id
00:00:00|A    |START_ev  |         1|       1
01:00:00|B    |any_ev    |         1|       2
02:00:00|C    |any_ev    |         1|       3
03:00:00|D    |any_ev    |         1|       4
04:00:00|E    |any_ev    |         1|       5
05:00:00|F    |any_ev    |         1|       6
06:00:00|G    |any_ev    |         1|       7
07:00:00|H    |END_ev    |         1|       8
08:00:00|B    |any_ev    |         2|       1
09:00:00|C    |any_ev    |         2|       2
10:00:00|D    |any_ev    |         2|       3
11:00:00|H    |END_ev    |         2|       4
12:00:00|C    |any_ev    |         3|       1
13:00:00|D    |any_ev    |         3|       2
14:00:00|E    |any_ev    |         3|       3
15:00:00|F    |any_ev    |         3|       4
16:00:00|H    |END_ev    |         3|       5
17:00:00|E    |any_ev    |         4|       1
18:00:00|F    |any_ev    |         4|       2
19:00:00|G    |any_ev    |         4|       3
20:00:00|H    |END_ev    |         4|       4
marcothesane
  • 6,192
  • 1
  • 11
  • 21
  • I have to admit this is a better solution for Vertica, assuming the `MATCH()` clause is optimized appropriately. – Gordon Linoff Mar 17 '17 at 12:23
  • Hi Gordon - I recently began to use the MATCH() clause in car sensor data to identify suspicious patterns in the measure curves of air mass, temperature and fuel throughput. The plan created is just like any plan including OLAP functions ( LEAD() , LAG(), SUM() OVER() and the like ), and is therefore absolutely within the parameters of an OLAP query. – marcothesane Mar 17 '17 at 14:02
  • . . That sounds like an interesting application. – Gordon Linoff Mar 18 '17 at 01:58