25

I have the following hive query:

select count(distinct id) as total from mytable;

which automatically spawns:
1408 Mappers
1 Reducer

I need to manually set the number of reducers and I have tried the following:

set mapred.reduce.tasks=50 
set hive.exec.reducers.max=50

but none of these settings seem to be honored. The query takes forever to run. Is there a way to manually set the reducers or maybe rewrite the query so it can result in more reducers? Thanks!

Keshav Pradeep Ramanath
  • 1,623
  • 4
  • 24
  • 33
magicalo
  • 463
  • 2
  • 5
  • 12

4 Answers4

46

writing query in hive like this:

 SELECT COUNT(DISTINCT id) ....

will always result in using only one reducer. You should:

  1. use this command to set desired number of reducers:

    set mapred.reduce.tasks=50

  2. rewrite query as following:

SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;

This will result in 2 map+reduce jobs instead of one, but performance gain will be substantial.

wlk
  • 5,695
  • 6
  • 54
  • 72
  • cool. how comes the hive compiler doesn't do this optimization (turning into 2 MR jobs) by itself automatically? – ihadanny Apr 26 '13 at 21:22
  • There are situations where turning this into 2 MR jobs isn't an optimization. For instance, if id is already close to unique and the table is stored in a columnar file format (like RCFILE), than 1 MR job would certainly be better. Since situations like that aren't outlandish, I imagine that's why no one has built this optimization into Hive. – Daniel Koverman May 16 '13 at 19:59
11

Number of reducers depends also on size of the input file

By default it is 1GB (1000000000 bytes). You could change that by setting the property hive.exec.reducers.bytes.per.reducer:

  1. either by changing hive-site.xml

    <property>
       <name>hive.exec.reducers.bytes.per.reducer</name>
       <value>1000000</value>
    </property>
    
  2. or using set

    $ hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"

user1314742
  • 2,865
  • 3
  • 28
  • 34
1

Mapper is totaly depend on number of file i.e size of file we can call it as input splits. Split is noting but the logical split of data. Ex: my file size is 150MB and my HDFS default block is 128MB. It will create two split means two blocks. Two Mapper will get assigned for this job.

Imp Note: Suppose I have specified the split size is 50MB then It will start 3 Mapper because of it totally depend on number of split.

Imp Note: if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

Note: If we haven't specifyed the split size it will take default hdfs block size as split size.

Reducer has 3 primary phases: shuffle, sort and reduce.

Command :

1] Set Map Task : -D mapred.map.tasks=4
2] Set Reduce Task : -D mapred.reduce.tasks=2

Viraj Wadate
  • 5,447
  • 1
  • 31
  • 29
1

You could set the number of reducers spawned per node in the conf/mapred-site.xml config file. See here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html.

In particular, you need to set this property:

mapred.tasktracker.reduce.tasks.maximum
Tudor
  • 61,523
  • 12
  • 102
  • 142
  • that is applicable for all jobs. If you want to set for a specific query, i think it is better use `set mapred.reduce.tasks` – brain storm Aug 06 '14 at 17:32