I'll try to explain.
Part 1:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
The UDAF GenericUDAFCount
is capable of both count
and count distinct
. How does it work to achieve count distinct
?
Let's take the following query as an example:
select category, count(distinct brand) from market group by category;
One MapReduce Job will be launched for this query.
distinct-by
keys are the expressions(columns) within count(distinct ...
, in this case, brand
.
partition-by
keys are the fields used to calculate a hash code for a record at map
phase. And then this hash value is used to decided which partition a record should go. Usually, partition-by
keys lies in the group by
part of a SQL query. In this case, it's category
.
The actual output-key
of mappers will be the composition of partition-by
key and a distinct-by
key. For the above case, a mapper's output key may be like (drink, Pepsi).
This design makes all rows with the same group-by
key fall into the same reducer.
The value
part of mappers' output doesn’t matter here.
Later at the Shuffle phase, records are sort according to the sort-by
keys, which is the same as the output key
.
Then at reduce
phase, at each individual reducer, all records are sorted first by category then by brand. This makes it easy to get the result of the count(distinct )
aggregation. Each distinct (category, brand) pair is guaranteed to be processed only once. The aggregation has been turned into a count(*)
at each group. The input key of a call to the reduce
method will be one of these distinct pairs. Reducer processes keep track of the composited key. Whenever the category part changes, we know a new group has come and we start counting this group from 1.
Part 2:
Why there will be only one reducer in this case?
When calculating count distinct
without group by
like this:
select count(distinct brand) from market
There will be just one reducer taking all the work. Why? Because the partition-by
key doesn’t exist, or we can say that all records has the same hash code. So they will fall into the same reducer.
Part 3:
Why the weird inner query will cause more partitions?
The inner query's partition-by
key is the group by
key, id
. There’s a chance that id
values are quite evenly distributed, so records are processed by many different reducers. Then after the inner query, it's safe to conclude that all the id
are different from each other. So now a simple count(1)
is all that's needed.
But do note that the output will launch only one reducer. Why doesn’t it suffer? Because no detailed values are needed for count(1)
, map-side aggregation hugely cut down the amount of data processed by reducers.
One more thing, this rewriting is not guaranteed to perform better since it introduces an extra MR stage.