1

I have a table of data about a user's flight booking patterns on a website. Let's assume the following data is all the historical data I have about my user.

The session_date is the day that the user came onto the website and searched a specific route, while the flight_date is the departure date of the flight. I have ordered the table by the session_date. The outcome is recorded in booked.

+---------+--------------+----------------+--------------+-------------+--------+
| user_id | session_date | departure_code | arrival_code | flight_date | booked |
+---------+--------------+----------------+--------------+-------------+--------+
| user1   | 7 Jan        | CA             | MY           | 8 Mar       |      1 |
| user1   | 8 Jan        | US             | MY           | 18 May      |      0 |
| user1   | 8 Jan        | US             | MY           | 18 May      |      1 |
| user1   | 8 Jan        | CA             | MY           | 19 Mar      |      0 |
| user1   | 9 Jan        | US             | MY           | 18 May      |      1 |
+---------+--------------+----------------+--------------+-------------+--------+

I would like to output a new column in my table, called previous_flight_date. The new column will state with each search, the previous booked flight_date for that specific route. Even if a user has searched that same route multiple times but never booked, the value in this column will be empty.


+-------+--------------+----------------+--------------+-------------+--------+----------------------+
|  _id  | session_date | departure_code | arrival_code | flight_date | booked | previous_flight_date |
+-------+--------------+----------------+--------------+-------------+--------+----------------------+
| user1 | 7 Jan        | CA             | SG           | 8 Mar       |      1 | null                 |
| user1 | 8 Jan        | US             | MY           | 18 May      |      0 | null                 |
| user1 | 8 Jan        | US             | MY           | 18 May      |      1 | null                 |
| user1 | 8 Jan        | CA             | SG           | 19 Mar      |      0 | 8 Mar                |
| user1 | 2 Feb        | US             | MY           | 2 Jul       |      1 | 18 May               |
+-------+--------------+----------------+--------------+-------------+--------+----------------------+

so for example, the column will be null until the 4th row which reflects '8 Mar', as the user had booked a flight from CA-->SG, to depart on that day.

I have tried using LAST_VALUE but it didn't work. I also do not know how I can use LAG() when I have multiple different types of routes, and I want to find previous rows on a condition. Would be great if a solution was suggested! thank you.

GMB
  • 216,147
  • 25
  • 84
  • 135
awks987
  • 45
  • 6

4 Answers4

5

I think you can do this with first_value(). The trick is to put a condition within the window function, turn on the ignore nulls option, and then use a window frame specifications that looks back on the previous rows having the same departure/arrival, not including the current row:

select
    t.*,
    first_value(case when booked = 1 then flight_date end ignore nulls) over(
        partition by departure_code, arrival code
        order by flight_date desc
        rows between unbounded preceding and 1 preceding
    ) previous_flight_date
from mytable t

Actually a window max() would work too (and then, no need for ignore nulls):

select
    t.*,
    max(case when booked = 1 then flight_date end) over(
        partition by departure_code, arrival code
        order by flight_date desc
        rows between unbounded preceding and 1 preceding
    ) previous_flight_date
from mytable t
GMB
  • 216,147
  • 25
  • 84
  • 135
2

I started off going with your suggestion to use LAG, but then found it to be rather difficult to phrase a query. For one approach which does not use analytic functions, we can try just using a correlated subquery to identify the most recent booked flight date on the same route.

SELECT user_id, session_date, departure_code, arrival_code, flight_date, booked,
       (SELECT t2.flight_date FROM yourTable t2
        WHERE t2.departure_code = t1.departure_code AND
              t2.arrival_code = t1.arrival_code AND
              t2.booked = 1 AND
              t2.flight_date < t1.flight_date
        ORDER BY t2.flight_date DESC LIMIT 1) AS previous_flight_date
FROM yourTable t1
ORDER BY flight_date;

Demo

Demo shown for MariaDB, but the same query should actually run on BigQuery without any issues.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • How could I then select the flight_date on the condition that the user had booked? Sorry if I didn't understand your solution immediately – awks987 Jul 28 '20 at 07:27
  • @awks987 I gave up on `LAG`. Check my edit for another way to go here (admittedly not the most performant). – Tim Biegeleisen Jul 28 '20 at 07:42
  • ok thank you! unfortunately BigQuery does not seem to support correlated queries so I might have to find a way to rewrite it into a JOIN. would that be possible? – awks987 Jul 28 '20 at 08:05
  • Bummer, because you can't really rewrite this using regular joins. The only other option I can think of would be `LAG`, but I don't know how to do it. I will leave this answer up to maybe help someone else give you the solution you need here. – Tim Biegeleisen Jul 28 '20 at 08:06
  • ah shucks. ok I'll leave this to fate then. thanks for your input! – awks987 Jul 28 '20 at 08:08
  • see my answer for BigQuery :o) – Mikhail Berlyant Jul 28 '20 at 08:35
  • @TimBiegeleisen: the problem is with the inequality join, that BQ dodes not support... – GMB Jul 28 '20 at 08:36
  • @TimBiegeleisen: I am afraid this BQ limitation rules out the correlated subquery solution... Window functions seem like the most relevant option here. Or maybe work something out with arrays? I don't have a BQ fiddle, which makes it more difficult to test around! – GMB Jul 28 '20 at 08:41
  • 1
    I have tried @GMB 's answer and it works great. thanks everyone for the help! – awks987 Jul 28 '20 at 10:36
2

The following is a SQL Server based solution using windowing functions. Big Query solution should be similar as windowing functions are standard

SELECT
    *
    , Previous_Flight_Date = MAX(CASE booked = 1 THEN flight_date ELSE NULL END ) 
                             OVER (
                                    PARTITION BY user_id, departure_code, arrival_code
                                    ORDER BY flight_date
                                    ROWS UNBOUNDED PRECEDING AND 1 PRECEDING
                             )
FROM historicTable t
K4M
  • 1,030
  • 3
  • 11
1

Below is for BigQuery Standard SQL

#standardSQL
SELECT user_id, session_date, departure_code, arrival_code, flight_date, booked,
  MAX(IF(booked = 1, flight_date, NULL)) OVER(previous_flights) AS previous_flight_date
FROM `project.dataset.table` 
WINDOW previous_flights AS (
  PARTITION BY user_id, departure_code, arrival_code 
  ORDER BY flight_date 
  ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)

If to apply to sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'user1' AS user_id, DATE '2020-01-07' AS session_date, 'CA' AS departure_code, 'SG' AS arrival_code, DATE '2020-03-08' AS flight_date, 1 AS booked UNION ALL
  SELECT 'user1', '2020-01-08', 'US', 'MY', '2020-05-18', 0 UNION ALL
  SELECT 'user1', '2020-01-08', 'US', 'MY', '2020-05-18', 1 UNION ALL
  SELECT 'user1', '2020-01-08', 'CA', 'SG', '2020-03-19', 0 UNION ALL
  SELECT 'user1', '2020-02-09', 'US', 'MY', '2020-07-02', 1
)
SELECT user_id, session_date, departure_code, arrival_code, flight_date, booked,
  MAX(IF(booked = 1, flight_date, NULL)) OVER(previous_flights) AS previous_flight_date
FROM `project.dataset.table` 
WINDOW previous_flights AS (
  PARTITION BY user_id, departure_code, arrival_code 
  ORDER BY flight_date 
  ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
-- ORDER BY flight_date

output is

Row user_id session_date    departure_code  arrival_code    flight_date booked  previous_flight_date     
1   user1   2020-01-07      CA              SG              2020-03-08  1       null     
2   user1   2020-01-08      CA              SG              2020-03-19  0       2020-03-08   
3   user1   2020-01-08      US              MY              2020-05-18  0       null     
4   user1   2020-01-08      US              MY              2020-05-18  1       null     
5   user1   2020-02-09      US              MY              2020-07-02  1       2020-05-18   
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230