Why is count(distinct) slower than group by in Hive?

Question

On Hive, I believe count(distinct) will be more likely than group-by to result in an unbalanced workload to reducers and end up with one sad reducer grinding away. Example query below.

Why?

Example query:

select count(distinct user)
from some_table

Version with group-by (proposed as faster):

select count(*) from
(select user
 from some_table
 group by user) q

Note: slide 26 of this presentation describes the problem.

I don't understand your question. Are you asking why the `group by` version is faster? If yes, then why do you believe it is faster? You read it somewhere or you saw it behave in that way? — Hari Menon, Oct 11 '13 at 06:21

score 21 · Answer 1 · answered Oct 11 '13 at 09:20

select count(distinct user)
from some_table;

This query does the count on the map side. Each mapper emits one value, the count. Then all values have to be aggregated to produce the total count, and that is the job of one single reducer.

select count(*) from
(select user
 from some_table
 group by user) q;

This query has two stages. On stage 1 the GROUP BY aggregates the users on the map side and emits one value for each user. The output has to be aggregated then on the reduce side, but it can use many reducers. On stage 2 the the COUNT is performed, on the map side, and then the final result is aggregated using one single reducer.

So if you have a very large number of map side splits then the first query will have to aggregate a very large number of one value results. The second query can use many reducers at the reduce side of stage 1 and then, at stage 2, will have a smaller task for the lone reducer at the end.

This would normally not be an optimization. You would have to have a significant number of map splits for the query 1 reducer to become a problem. The second query has two stages and that alone would be slower than query 1 (stage 2 cannot start until stage 1 is completely done). So, while I can see some reasoning for the advice you got, I would be skeptical unless proper measurement is done and shows improvement.

in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE.. — Harper Koo, Mar 28 '16 at 10:00
upvote your answer. An example for large table is : if there is very few distinct values , then the first option may run faster than the second since most group by is done in map side — Keith, Apr 22 '17 at 04:00

Why is count(distinct) slower than group by in Hive?

1 Answers1