-1

I have a table that records user transactions like this (simplified version) in BigQuery

user transaction_date label cost
a 2021-10-31 10:30:00 y1 10
b 2021-10-31 10:30:00 y2 10
c 2021-10-31 10:30:00 y1 10
a 2021-11-31 10:30:00 y1 10
a 2021-12-31 10:30:00 y2 10
b 2021-11-31 10:30:00 y3 10
c 2021-11-31 10:30:00 y1 10
b 2021-12-31 10:30:00 y2 10
c 2021-12-31 10:30:00 y1 10

I am interested in information related to cost and current/next label. How can I use LEAD() to return the next different label as label_next?

i.e

user transaction_date label cost label_next
a 2021-10-31 10:30:00 y1 10 y2
b 2021-10-31 10:30:00 y2 10 y3
c 2021-10-31 10:30:00 y1 10 y3
a 2021-11-31 10:30:00 y1 10 y2
a 2021-12-31 10:30:00 y2 10 y5
b 2021-11-31 10:30:00 y3 10 y2
c 2021-11-31 10:30:00 y1 10 y3
b 2021-12-31 10:30:00 y2 10 null
c 2021-12-31 10:30:00 y3 10 null
a 2021-12-31 18:30:00 y5 10 null

standard LEAD() would return just the next label i.e. for row 1 (user a) would return y1 as the same user is registered with the same label once before being seen with a different label.

I think I have one solution that involves: grouping by user, label and calculate min and max transaction_date then use LEAD() to get label_next per grouping and join that table on the initial table on user and transaction_date inside min and max transaction_date

But is there a way to do it differently?

forpas
  • 160,666
  • 10
  • 38
  • 76
henrycmcjo
  • 93
  • 5

3 Answers3

0

We can use a conditional aggregation approach here:

WITH cte AS (
    SELECT *, COUNT(CASE WHEN LEAD(label) OVER (PARTITION BY user
                                                ORDER BY transaction_date) != label
                         THEN 1 END) AS cnt
    FROM yourTable
),
cte2 AS (
    SELECT DISTINCT user, cnt, label
    FROM cte
)

SELECT t1.user, t1.transaction_date, t1.label, t2.label AS label_next
FROM cte t1
LEFT JOIN cte2 t2
    ON t2.user = t1.user AND
       t2.cnt = t1.cnt + 1
ORDER BY t1.user, t1.transaction_date;

The idea here is to generate a running count, across each record for each user, which increments by 1 every time the label changes. We use this result to generate (in cte2) an intermediate table where, given a count value and user, we can lookup the next different label. The final left join is where this lookup is taking place.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

Another simple approach we can take that doesn't involve lead function is to find the next label for a user where the label value is not matching current label and the transaction date is greater than current date.

edit: edited to fetch next label based on date not on min label.

    with temp as (select 'a' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-12-31 10:30:00   ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'a' user, '2021-13-31 20:30:00   ' as transaction_date, 'y5' as label, '10' as cost
union all
select 'a' user, '2021-14-31 20:30:00   ' as transaction_date, 'y0' as label, '10' as cost
union all
select 'b' user, '2021-10-31 10:30:00' as transaction_date, 'y2' as label, '10' as cost
union all
select 'b' user, '2021-11-31 10:30:00' as transaction_date, 'y3' as label, '10' as cost
union all
select 'b' user, '2021-12-31 10:30:00   ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'c' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-12-31 10:30:00   ' as transaction_date, 'y3' as label, '10' as cost
union all
select 'c' user, '2021-12-31 20:30:00   ' as transaction_date, 'y4' as label, '10' as cost
)
select user, transaction_date, label, cost,  (select label from temp b where b.user = a.user and b.transaction_date = (select min(transaction_date) from temp c where a.label <> c.label and a.user = c.user and c.transaction_date > a.transaction_date)
 ) as label_next
from temp a
bilalgazge
  • 38
  • 1
  • 7
  • The problem with your subquery is that the minimum label is not necessarily the next one on the timeline. What _might_ work would be to use a limit subquery ordered by time. – Tim Biegeleisen Oct 31 '22 at 13:23
  • Sorry you are right. I had in mind only the scenarios you mentioned in the question. So, now I have edited the query to fetch the next label based on next upcoming date of changed label, not minimum label itself. Please highlight if any other scenarios are failing – bilalgazge Oct 31 '22 at 13:58
0

Below might be an another option.

SELECT * EXCEPT(labels),
       (SELECT l FROM t.labels WHERE t.label <> l ORDER BY td LIMIT 1) label_next
  FROM (
    SELECT *, ARRAY_AGG(STRUCT(label AS l, transaction_date AS td)) OVER w AS labels
      FROM sample_table
    WINDOW w AS (PARTITION BY user ORDER BY transaction_date DESC)
  ) t;
Query results

enter image description here

Jaytiger
  • 11,626
  • 2
  • 5
  • 15