How to use lead() efficiently and return next different value?

Question

I have a table that records user transactions like this (simplified version) in BigQuery

user	transaction_date	label	cost
a	2021-10-31 10:30:00	y1	10
b	2021-10-31 10:30:00	y2	10
c	2021-10-31 10:30:00	y1	10
a	2021-11-31 10:30:00	y1	10
a	2021-12-31 10:30:00	y2	10
b	2021-11-31 10:30:00	y3	10
c	2021-11-31 10:30:00	y1	10
b	2021-12-31 10:30:00	y2	10
c	2021-12-31 10:30:00	y1	10

I am interested in information related to cost and current/next label. How can I use LEAD() to return the next different label as label_next?

i.e

user	transaction_date	label	cost	label_next
a	2021-10-31 10:30:00	y1	10	y2
b	2021-10-31 10:30:00	y2	10	y3
c	2021-10-31 10:30:00	y1	10	y3
a	2021-11-31 10:30:00	y1	10	y2
a	2021-12-31 10:30:00	y2	10	y5
b	2021-11-31 10:30:00	y3	10	y2
c	2021-11-31 10:30:00	y1	10	y3
b	2021-12-31 10:30:00	y2	10	null
c	2021-12-31 10:30:00	y3	10	null
a	2021-12-31 18:30:00	y5	10	null

standard LEAD() would return just the next label i.e. for row 1 (user a) would return y1 as the same user is registered with the same label once before being seen with a different label.

I think I have one solution that involves: grouping by user, label and calculate min and max transaction_date then use LEAD() to get label_next per grouping and join that table on the initial table on user and transaction_date inside min and max transaction_date

But is there a way to do it differently?

Tag only the database that you use. – forpas Oct 31 '22 at 16:36 — forpas, Oct 31 '22 at 16:36

Tim Biegeleisen · Answer 1 · 2022-10-31T12:11:26.843

We can use a conditional aggregation approach here:

WITH cte AS (
    SELECT *, COUNT(CASE WHEN LEAD(label) OVER (PARTITION BY user
                                                ORDER BY transaction_date) != label
                         THEN 1 END) AS cnt
    FROM yourTable
),
cte2 AS (
    SELECT DISTINCT user, cnt, label
    FROM cte
)

SELECT t1.user, t1.transaction_date, t1.label, t2.label AS label_next
FROM cte t1
LEFT JOIN cte2 t2
    ON t2.user = t1.user AND
       t2.cnt = t1.cnt + 1
ORDER BY t1.user, t1.transaction_date;

The idea here is to generate a running count, across each record for each user, which increments by 1 every time the label changes. We use this result to generate (in cte2) an intermediate table where, given a count value and user, we can lookup the next different label. The final left join is where this lookup is taking place.

bilalgazge · Answer 2 · 2022-10-31T13:56:22.237

Another simple approach we can take that doesn't involve lead function is to find the next label for a user where the label value is not matching current label and the transaction date is greater than current date.

edit: edited to fetch next label based on date not on min label.

    with temp as (select 'a' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-12-31 10:30:00   ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'a' user, '2021-13-31 20:30:00   ' as transaction_date, 'y5' as label, '10' as cost
union all
select 'a' user, '2021-14-31 20:30:00   ' as transaction_date, 'y0' as label, '10' as cost
union all
select 'b' user, '2021-10-31 10:30:00' as transaction_date, 'y2' as label, '10' as cost
union all
select 'b' user, '2021-11-31 10:30:00' as transaction_date, 'y3' as label, '10' as cost
union all
select 'b' user, '2021-12-31 10:30:00   ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'c' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-12-31 10:30:00   ' as transaction_date, 'y3' as label, '10' as cost
union all
select 'c' user, '2021-12-31 20:30:00   ' as transaction_date, 'y4' as label, '10' as cost
)
select user, transaction_date, label, cost,  (select label from temp b where b.user = a.user and b.transaction_date = (select min(transaction_date) from temp c where a.label <> c.label and a.user = c.user and c.transaction_date > a.transaction_date)
 ) as label_next
from temp a

The problem with your subquery is that the minimum label is not necessarily the next one on the timeline. What _might_ work would be to use a limit subquery ordered by time. — Tim Biegeleisen, Oct 31 '22 at 13:23
Sorry you are right. I had in mind only the scenarios you mentioned in the question. So, now I have edited the query to fetch the next label based on next upcoming date of changed label, not minimum label itself. Please highlight if any other scenarios are failing — bilalgazge, Oct 31 '22 at 13:58

score 0 · Answer 3 · answered Oct 31 '22 at 16:07

Below might be an another option.

SELECT * EXCEPT(labels),
       (SELECT l FROM t.labels WHERE t.label <> l ORDER BY td LIMIT 1) label_next
  FROM (
    SELECT *, ARRAY_AGG(STRUCT(label AS l, transaction_date AS td)) OVER w AS labels
      FROM sample_table
    WINDOW w AS (PARTITION BY user ORDER BY transaction_date DESC)
  ) t;

How to use lead() efficiently and return next different value?

3 Answers3

Query results