Optimizing Hive GROUP BY when rows are sorted

Question

I have the following (very simple) Hive query:

select user_id, event_id, min(time) as start, max(time) as end,
       count(*) as total, count(interaction == 1) as clicks
from events_all
group by user_id, event_id;

The table has the following structure:

user_id                 event_id                time            interaction 
Ex833Lli36nxTvGTA1Dv    juCUv6EnkVundBHSBzQevw  1430481530295   0
Ex833Lli36nxTvGTA1Dv    juCUv6EnkVundBHSBzQevw  1430481530295   1
n0w4uQhOuXymj5jLaCMQ    G+Oj6J9Q1nI1tuosq2ZM/g  1430512179696   0
n0w4uQhOuXymj5jLaCMQ    G+Oj6J9Q1nI1tuosq2ZM/g  1430512217124   0
n0w4uQhOuXymj5jLaCMQ    mqf38Xd6CAQtuvuKc5NlWQ  1430512179696   1

I know for a fact that rows are sorted first by user_id and then by event_id.

The question is: is there a way to "hint" the Hive engine to optimize the query given that rows are sorted? The purpose of optimization is to avoid keeping all groups in memory since its only necessary to keep one group at a time.

Right now this query running in a 6-node 16 GB Hadoop cluster with roughly 300 GB of data takes about 30 minutes and uses most of the RAM, choking the system. I know that each group will be small, no more than 100 rows per (user_id, event_id) tuple, so I think an optimized execution will probably have a very small memory footprint and also be faster (since there is no need to loopup group keys).

As a side-note, the `count(interaction == 1)` is not working as I expected, counting only rows that have a 1, but instead returning the same as `count(*)`. — Alejandro Piad, Dec 20 '16 at 17:30
Yep. Aggregative functions, COUNT included, ignore (only) NULL values and FALSE is not a NULL — David דודו Markovitz, Dec 20 '16 at 17:38

score 1 · Accepted Answer · answered Dec 20 '16 at 18:40

1

Create a bucketed sorted table. The optimizer will know it sorted from metadata. See example here (official docs): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables

Count only interaction = 1: count(case when interaction=1 then 1 end) as clicks - case will mark all rows with 1 or null and count only 1s.

answered Dec 20 '16 at 18:40

leftjoin

36,950
8
57
116

Thanks @leftjoin. A few things: first, my table is an external table, is there a way to make it work? Second, since it is an external table, the format is fixed (tab separated values, \n terminated rows), and it doesn't have different terminators for groups, etc. If it's not much to ask, could you provide an example using the specific structure I posted? – Alejandro Piad Dec 20 '16 at 21:22
@Alejandro Piad Read also this: http://grokbase.com/t/hive/user/133xgs10cb/bucketing-external-tables – leftjoin Dec 21 '16 at 10:52
I'm sorry but it seems you will have to create bucketed table and insert overwrite it, it will not work if you create external table over your existing text files. and it will take a lot of time to move data. – leftjoin Dec 21 '16 at 11:00
Yes @leftjoin, everything I've read points to that. I'm accepting this as the right answer. – Alejandro Piad Dec 21 '16 at 11:32

Optimizing Hive GROUP BY when rows are sorted

1 Answers1

Linked