BigQuery most recent events per user in a logfile

Question

This must be simple but it's got me stumped in BQ

Simple log file:

mysql> select * from wow;
+------+---------------+------------+
| id   | string        | epoch      |
+------+---------------+------------+
| id3  | a new user    | 1400783480 |
| id1  | a log event 1 | 1400783492 |
| id1  | a log event 2 | 1400783499 |  
| id1  | a log event 3 | 1400783503 |
| id2  | a log event 1 | 1400783510 |
| id2  | a log event 2 | 1400783516 |
+------+---------------+------------+

I want to do the equivalent in BQ to find the last event for each Id:

mysql> select * from (select * from wow as B order by epoch desc) as A group by id;
+------+---------------+------------+
| id   | string        | epoch      |
+------+---------------+------------+
| id1  | a log event 3 | 1400783503 |
| id2  | a log event 2 | 1400783516 |
| id3  | a new user    | 1400783480 |
+------+---------------+------------+
3 rows in set (0.00 sec)

Any tips/helps much appreciated.

possible duplicate of [Google BigQuery: How do I get a distinct row for a value in query results](http://stackoverflow.com/questions/20110710/google-bigquery-how-do-i-get-a-distinct-row-for-a-value-in-query-results) — Pentium10, May 23 '14 at 07:35

score 2 · Accepted Answer · answered May 23 '14 at 09:13

The simplest way is probably to do a query with a GROUP BY and MIN to find the (id, epoch) pairs you want, and then join it back to the original table:

SELECT t.*
FROM 
    (SELECT id, MAX(epoch) AS max_epoch FROM [tmp.so1] GROUP BY id) AS keys 
  JOIN 
    [tmp.so1] AS t 
  ON keys.id = t.id AND keys.max_epoch = t.epoch 
ORDER BY t.id

For your data, this indeed gives back:

+------+---------------+------------+
| t_id |     t_msg     |  t_epoch   |
+------+---------------+------------+
| id1  | a log event 3 | 1400783503 |
| id2  | a log event 2 | 1400783516 |
| id3  | a new user    | 1400783480 |
+------+---------------+------------+

Three notes:

I used the handy t.* shorthand for the results, but you can actually map them through if you want the original names
If your table grows large, you'll probably want to switch to a JOIN EACH.
In the case that you have multiple entries with the minimal epoch for a given ID, you'll get doubled rows.

N.N. · Answer 2 · 2014-05-25T09:04:33.680

I'd rather avoid Joins, as there are more elegant ways to do this:

If your data set is not too big, use rownumber():

Select * from (Select *,row_number() over (partition by id order by epoc desc) as RNB from t) where RNB=1
for larger data sets, you can manipulate the max(time) and the log string to fid the last log entry per user.

Select id, substring(max(concat(string(epoch),logstring)),10) from t group each by id

Hope this helps.

BigQuery most recent events per user in a logfile

2 Answers2