According to the Hadoop manual, mapper jobs should be started on a node where the input data is stored on the HDFS, if a slot is available.
Unfortunately I found this not to be true when using the Hadoop Streaming library, as jobs were launched on totally different nodes compared to physical location of the input file (given by the -input
flag), while no other jobs were executed on the system.
Tested on multiple systems, Hadoop 2.6.0 and 2.7.2 Is there any way to affect this behaviour, since the input files are large and this unnecessary network traffic significantly modifies the overall performance?
* UPDATE *
Following the suggestions in the comments, I tested the issue with "normal" Hadoop jobs, specifically with the classic WordCount example. The results were the same, from 10 executions only 1 time was a mapper execution node selected where the data was already present. I verified the block availability and the selected execution node both by the web interface (NameNode and YARN web UI) and by command line tools (hdfs fsck ...
and reading the log file of YARN). I freshly restarted the cluster and verified that no other interfering jobs were running.
What I also noticed is that Data-local map tasks
counter is not even present in the summary output of the job, I only get Rack-local map tasks
. Of course it is rack-local, as there is only 1 rack on this testing environment. Is there a configuration option I am missing?
File System Counters
FILE: Number of bytes read=3058
FILE: Number of bytes written=217811
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3007
HDFS: Number of bytes written=2136
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=12652
Total time spent by all reduces in occupied slots (ms)=13368
Total time spent by all map tasks (ms)=3163
Total time spent by all reduce tasks (ms)=13368
Total vcore-seconds taken by all map tasks=3163
Total vcore-seconds taken by all reduce tasks=13368
Total megabyte-seconds taken by all map tasks=12955648
Total megabyte-seconds taken by all reduce tasks=13688832
Map-Reduce Framework
Map input records=9
Map output records=426
Map output bytes=4602
Map output materialized bytes=3058
Input split bytes=105
Combine input records=426
Combine output records=229
Reduce input groups=229
Reduce shuffle bytes=3058
Reduce input records=229
Reduce output records=229
Spilled Records=458
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=75
CPU time spent (ms)=1480
Physical memory (bytes) snapshot=440565760
Virtual memory (bytes) snapshot=4798459904
Total committed heap usage (bytes)=310902784
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2902
File Output Format Counters
Bytes Written=2136
* UPDATE 2 *
I realized the only reason Data-local map tasks
counter is missing from the summary, is because it is 0, so it is omitted. I reconfigured the cluster through the net.topology.script.file.name
parameter, so each node would be a separate rack (see bash example in Hadoop Manual).
Now Hadoop proudly reports, the the performed task was not even rack-local. It seems like the scheduler (I use the default CapacityScheduler) does not care about data or rack locality at all!
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1