I have a table in Amazon Athena (a Presto engine) of some user events that among others have userId (VARCHAR) and ts (timestamp, BIGINT) column. The table is quite big, some hundreds of millions of records. I would like to create a query that orders the events by the ts column. I quickly figured out that I can not use a global ORDER BY ts
because this would mean that all ts values should be stored in the memory of a single worker node and it causes an out-of-memory type error.
The point is that actually I do not need to sort these events globally, it would be enough if they were sorted for a single userid. I was trying to use also a compound query where the outer query gets all unique userid and the inner query use a WHERE userid = current_userid
clause but as I have around 50k distinct userid this query runs way too much time. (Same thing for JOIN
ing together the results of the query as described here). I am looking for some sql construct that uses sorts the rows only partially, something like ORDER BY ts OVER (PARTITION BY userid)
(this clause is not valid).