1
create external table if not exists my_table
(customer_id STRING,ip_id STRING)
location 'ip_b_class';

And then:

hive> set mapred.reduce.tasks=50;
hive> select count(distinct customer_id) from my_table;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1

There's 160GB in there, and with 1 reducer it takes a long time...

[ihadanny@lvshdc2en0011 ~]$ hdu 
Found 8 items
162808042208   hdfs://horton/ip_b_class

...

ihadanny
  • 4,377
  • 7
  • 45
  • 76
  • 1
    This has already been addressed in the question [here](http://stackoverflow.com/questions/8762064/hive-unable-to-manually-set-number-of-reducers?rq=1). The top answer is correct. The query you are running when run as a single MapReduce job can only use one reducer (essentially you are aggregating all 160GB down to a single number output by the one reducer, what could you expect the other reducers to output?). You could imagine reducers counting chunks and then adding their outputs together, but this would require two map reduce jobs. Rewriting the query can make Hive run the two job version. – Daniel Koverman Apr 25 '13 at 15:26

2 Answers2

2

Logically you cannot have more than one reducer here. Unless all the distinct customer IDs from the individual map tasks come to one place the distinctness can not be established and a single count can not be produced. In other words unless you heap all the customer IDs together in one place, you cannot say each one is distinct and eventually count them.

Rags
  • 1,891
  • 18
  • 19
1

The originial answer and explanation provided by @Rags is correct. The attached link give you good workaround by re-writing your query. I would suggest that if you don't want to rewrite your query, provide more memory to reducer by using this option:

set mapreduce.reduce.java.opts=-Xmx8000m

That options sets memory max used by reducer to 8 GB. if you have more then you can specify higher value here. Hope this helps

rp1
  • 159
  • 2
  • 10