7

When I run "hadoop job -status xxx",Output the following some list.

Rack-local map tasks=124
Data-local map tasks=6

What is the difference between Rack-local map tasks and Data-local map tasks?

Sam
  • 155
  • 2
  • 7
  • 3
    Thoma's answer is right, but I'd be worried about the number of rack-local tasks vs. data-local. You want a lot more data-local tasks than that. On larger clusters I typically see ~95% of them be data local. With you it is the opposite. – Donald Miner Oct 08 '12 at 00:29
  • @DonaldMiner yes, that is not good. However, it is largely depending on how many jobs are running on that cluster. Sometimes you need to sacrifice the performance of a job, so that the other can be faster. – Thomas Jungblut Oct 08 '12 at 07:02
  • @ThomasJungblut that number still doesn't sound right. On larger clusters with 3x replication, even with full slot capacity, I've seen that number a lot higher. – Donald Miner Oct 08 '12 at 12:54
  • @DonaldMiner It doesn't happen on larger clusters in that excessive way ;) I guess he's running 3-4 servers and a job blocked slots on the machines (just replication factor of two?) where the data lies. But since this is just speculation, let's don't argue about that too much. – Thomas Jungblut Oct 08 '12 at 13:28

2 Answers2

9

In a data-local task, nothing needs to be copied. That's because the block is physically on the same server like the computation.

The next tier is the rack-local task, here the data must be copied, because there is no local copy of the desired block available. Note that rack-local does only copy within the rack-local switching of the network.

There is also the worst case, where the data isn't available local, nor on the same rack. So this must be copied over two switches to the hosts where the computation runs. I don't know if there is a counter for that, but basically this must be #all tasks - #data-local tasks - #rack-local tasks.

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
1

I would point out that providing gigabit (or faster) network between computers within the same rack is much cheaper that for bigger number of computers.
The root cause is the fact that ethernet switches are not scalable and we can not have such switch for hundreds of ports in reasonable price.
Because of it hadoop tries to run tasks at least in the same rack, if can not do it on the node where data is stored.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30