1

I have a BigQuery table as below:

date  hits_eventInfo_Category  hits_eventInfo_Action  session_id  user_id  hits_time  hits_eventInfo_Label

20151021  Air  Search  1445001  A232  1952  City1
20151021  Air  Select  1445001  A232  2300  Vendor1
20151021  Air  Search  1445001  A111  1000  City2
20151021  Air  Search  1445001  A111  1900  City3
20151021  Air  Select  1445001  A111  7380  Vendor2
20151021  Air  Search  1445001  A580  1000  City4
20151021  Air  Search  1445001  A580  1900  City5
20151021  Air  Search  1445001  A580  1900  City6
20151021  Air  Select  1445001  A580  7380  Vendor3

The table shows user activity for 3 users - A232, A111 and A580 such that:

i) A232 - Made 1 Search at 'City1' and chose 'Vendor1' from 'City1'
ii) A111 - Made the 1st search at 'City2' and did not choose any vendor from there. Made a 2nd search at 'City3' and then ultimately chose a 'Vendor2' from here.
iii) A580 - 1st search at 'City4', no vendor chosen. 2nd search at 'City5', no vendor chosen. 3rd search at 'City6', 'Vendor3' chosen from City6.

I am interested in only retrieving the city from which the user actually chose a vendor, that is, not interested in previous searches made by the user that did not result into choosing a vendor.

Required output table:

date  hits_eventInfo_Category  hits_eventInfo_Action  session_id  user_id  hits_time  city  vendor

20151021  Air  Search  1445001  A232  1952  City1  Vendor1
20151021  Air  Search  1445001  A111  1900  City3  Vendor2
20151021  Air  Search  1445001  A580  1900  City6  Vendor3

I have been trying to do this using the LAG function over the hits_eventInfo_eventLabel field after partitioning on user_id and ordering by hits_time i.e LAG(hits_eventInfo_eventLabel,1) OVER( PARTITION BY user_id ORDER BY hits_time)

However, since I am using my lag offset as 1, the above expression helps me get the desired output for user A232 only (as he made only 1 search which means the previous record from before selecting the vendor is for sure a search record).

Is there a way I can make this lag expression more dynamic such that it retrieves only the immediate location searched upon before making a selection - irrespective of how many searches were made before a selection was made?

OR

Is there an alternative function/route that I can take to achieve this?

activelearner
  • 7,055
  • 20
  • 53
  • 94

1 Answers1

1
select 
  date, 
  hits_eventInfo_Category, 
  hits_eventInfo_Action, 
  session_id, 
  user_id, 
  hits_time, 
  prev as city, 
  hits_eventInfo_Label as vendor
from (
  select *, 
    lag(hits_eventInfo_Label, 1) over(partition by user_id order by hits_time) as prev
  from dataset.table
)
where hits_eventInfo_Action = 'Select'
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230