3

Suppose A join B on A.a=B.a, and both of them are big tables. Hive will process this join operation through common join. The execution graph(given by facebook): enter image description here

But I'm confused by this graph, is there only on reducer?

In my understanding, the map output key is table_name_tag_prefix+join_key. But in partition phase, it still uses the join_key to partition the records. In reduce phase, each reducer reads the <join_key,value> which have the same join key, the reducer needn't read all map splits.

AstroCB
  • 12,337
  • 20
  • 57
  • 73
xizi
  • 53
  • 4

2 Answers2

2

The amount of reducers is defined by hive.exec.reducers.bytes.per.reducer (default 1GB).
So for each GB of input data to the mappers you will get 1 reducer.
Then, hive uses the hash() function on the join columns and does modulo operation on the output of the hash function with the number of reducers that was set first.

So if you load 10 GB of data (both tables together) there should be ~10 reducers.
No lets say we join by column ID so lets assume the next outputs :
hash(101)=101 -> 101%10=1
hash(102)=102 -> 102%10=2
hash(1001)=1001 -> 1001%10=1

So the lines with the value 101 and 1001 in the ID column will go to reducer #1 and ID 102 will go to reducer #2. you will still have 10 reducers but if all the data has only the above ID's then 8 reducers will get no input and 2 reducers will get the rest.

dimamah
  • 2,883
  • 18
  • 31
1

In theory, there are both situations where there is just one or there are more than one reducer. The exact number of reducers used will depend on query details.

You can attempt to set the number of reducers to be used using the following in your script.

set mapred.reduce.tasks=50

Whether this actually leads to any performance increase depends on the query you are executing. For more detail, see also this answer.

Hope that helps.

Community
  • 1
  • 1
Lukas Vermeer
  • 5,920
  • 2
  • 16
  • 19