sql, big query: aggregate all entries between two strings in a variable

Question

I have to solve this problem within bigQuery. I have this column in my table:

event            | time
_________________|____________________
start            | 1
end              | 2
random_event_X   | 3
start            | 4 
error_X          | 5 
error_Y          | 6
end              | 7
start            | 8
error_F          | 9
start            | 10
random_event_Y   | 11
error_z          | 12
end              | 13

I would like to, from the end event record everything until start appear and then count it. Everything can happen between start and end and outside of it. If there is an end, there is a start, but if there is a start, there is not necessarily an end.

The desire output would be like:

string_agg                            | count
"start, end"                          |  1
"start, error_X, error_Y, end"        |  1
"start, random_event_Y error_Z, end"  |  1

So everything between each start and end if start has an end. So without the random_event_X at time 3, the start at time 8 or the error_F at time 9.

I was not able to find the solution and have struggle understanding how to approach this problem. Any help or advice is welcome.

score 2 · Accepted Answer · answered Feb 18 '20 at 18:42

Below is for BigQuery Standard SQL

#standardSQL
SELECT agg_events, COUNT(1) cnt 
FROM (
  SELECT STRING_AGG(event ORDER BY time) agg_events, COUNTIF(event IN ('start', 'end')) flag   
  FROM (
    SELECT *, COUNTIF(event = 'start') OVER(PARTITION BY grp1 ORDER BY time) grp2     
    FROM (
      SELECT *, COUNTIF(event = 'end') OVER(ORDER BY time DESC) grp1 
      FROM `project.dataset.table`
    )
  )
  GROUP BY grp1, grp2
)
WHERE flag = 2
GROUP BY agg_events

If to apply to sample data from your question - result is

Row agg_events                          cnt  
1   start,random_event_Y,error_z,end    1    
2   start,error_X,error_Y,end           1    
3   start,end                           1

Gordon Linoff · Answer 2 · 2020-02-18T13:38:41.160

SQL tables represent unordered sets -- this is particularly true in massively parallel, columnar databases such as BigQuery.

So, I have to assume that you have some other column that specifies the ordering. If so, you can use a cumulative sum to identify the groups and then aggregation:

select grp,
       string_agg(event, ',' order by time)
from (select t.*,
             countif(event = 'start') over (order by time) as grp
      from t
     ) t
group by grp
order by min(time);

Note: I would also advise you to use array_agg() instead of string_agg(). Arrays are generally easier to work with than strings.

EDIT:

I see, you only want up to end. In that case, another level of window funtions:

select grp,
       string_agg(event, ',' order by <ordering col>)
from (select t.*,
             max(case when event = 'end' then time end) over (partition by grp) as max_end_time
      from (select t.*,
                   countif(event = 'start') over (order by <ordering col>) as grp
            from t
           ) t
     ) t
where max_end_time is null or time <= max_end_time
group by grp
order by min(<ordering col>);

thank you for answering. There is indeed an ordering column which is an UNIX timestamp, I will edit my problem so it includes it. Concerning the proposed query, it does not record everything between each `start` and `end` of a column. When I apply it, I have an output like `"start, end, event_x, event_A, event_B"` and so on. As precised, a start does not necessarily an end, which makes this problem even harder for me to solve. — Nine, Feb 18 '20 at 12:06

sql, big query: aggregate all entries between two strings in a variable

2 Answers2