1

I've read a question on SO:

I ran into a Hive query calculating a count distinct without grouping, which runs very slow. So I was wondering how is this functionality implemented in Hive, is there a UDAFCountDistinct for this?

And the answer:

To achieve count distinct, Hive relies on the GenericUDAFCount. There is no UDAF specifically implemented for count distinct. Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase, this way they are 'distincted' quite natually.

As per your case, it runs slowly because there will be only one reducer to process massive detailed data. You can use a group by before counting to get more parallelism:

select count(1) from (select id from tbl group by id) tmp;

However I don't understand a few things:

  1. What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
  2. Why there will be only one reducer in this case?
  3. Why the weird inner query will cause more partitions?
Alon
  • 10,381
  • 23
  • 88
  • 152

1 Answers1

1

I'll try to explain.

Part 1:

What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it? The UDAF GenericUDAFCount is capable of both count and count distinct. How does it work to achieve count distinct?

Let's take the following query as an example:

select category, count(distinct brand) from market group by category;

One MapReduce Job will be launched for this query.

distinct-by keys are the expressions(columns) within count(distinct ..., in this case, brand.

partition-by keys are the fields used to calculate a hash code for a record at map phase. And then this hash value is used to decided which partition a record should go. Usually, partition-by keys lies in the group by part of a SQL query. In this case, it's category.

The actual output-key of mappers will be the composition of partition-by key and a distinct-by key. For the above case, a mapper's output key may be like (drink, Pepsi).

This design makes all rows with the same group-by key fall into the same reducer.

The value part of mappers' output doesn’t matter here.

Later at the Shuffle phase, records are sort according to the sort-by keys, which is the same as the output key.

Then at reduce phase, at each individual reducer, all records are sorted first by category then by brand. This makes it easy to get the result of the count(distinct ) aggregation. Each distinct (category, brand) pair is guaranteed to be processed only once. The aggregation has been turned into a count(*) at each group. The input key of a call to the reduce method will be one of these distinct pairs. Reducer processes keep track of the composited key. Whenever the category part changes, we know a new group has come and we start counting this group from 1.

Part 2:

Why there will be only one reducer in this case? When calculating count distinct without group by like this:

    select count(distinct brand) from market

There will be just one reducer taking all the work. Why? Because the partition-by key doesn’t exist, or we can say that all records has the same hash code. So they will fall into the same reducer.

Part 3:

Why the weird inner query will cause more partitions?

The inner query's partition-by key is the group by key, id. There’s a chance that id values are quite evenly distributed, so records are processed by many different reducers. Then after the inner query, it's safe to conclude that all the id are different from each other. So now a simple count(1) is all that's needed.
But do note that the output will launch only one reducer. Why doesn’t it suffer? Because no detailed values are needed for count(1), map-side aggregation hugely cut down the amount of data processed by reducers.
One more thing, this rewriting is not guaranteed to perform better since it introduces an extra MR stage.

damientseng
  • 533
  • 2
  • 19