Data locality not achieved for Hadoop jobs

Question

According to the Hadoop manual, mapper jobs should be started on a node where the input data is stored on the HDFS, if a slot is available.

Unfortunately I found this not to be true when using the Hadoop Streaming library, as jobs were launched on totally different nodes compared to physical location of the input file (given by the -input flag), while no other jobs were executed on the system.

Tested on multiple systems, Hadoop 2.6.0 and 2.7.2 Is there any way to affect this behaviour, since the input files are large and this unnecessary network traffic significantly modifies the overall performance?

* UPDATE *

Following the suggestions in the comments, I tested the issue with "normal" Hadoop jobs, specifically with the classic WordCount example. The results were the same, from 10 executions only 1 time was a mapper execution node selected where the data was already present. I verified the block availability and the selected execution node both by the web interface (NameNode and YARN web UI) and by command line tools (hdfs fsck ... and reading the log file of YARN). I freshly restarted the cluster and verified that no other interfering jobs were running.

What I also noticed is that Data-local map tasks counter is not even present in the summary output of the job, I only get Rack-local map tasks. Of course it is rack-local, as there is only 1 rack on this testing environment. Is there a configuration option I am missing?

File System Counters
        FILE: Number of bytes read=3058
        FILE: Number of bytes written=217811
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=3007
        HDFS: Number of bytes written=2136
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Rack-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=12652
        Total time spent by all reduces in occupied slots (ms)=13368
        Total time spent by all map tasks (ms)=3163
        Total time spent by all reduce tasks (ms)=13368
        Total vcore-seconds taken by all map tasks=3163
        Total vcore-seconds taken by all reduce tasks=13368
        Total megabyte-seconds taken by all map tasks=12955648
        Total megabyte-seconds taken by all reduce tasks=13688832
Map-Reduce Framework
        Map input records=9
        Map output records=426
        Map output bytes=4602
        Map output materialized bytes=3058
        Input split bytes=105
        Combine input records=426
        Combine output records=229
        Reduce input groups=229
        Reduce shuffle bytes=3058
        Reduce input records=229
        Reduce output records=229
        Spilled Records=458
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=75
        CPU time spent (ms)=1480
        Physical memory (bytes) snapshot=440565760
        Virtual memory (bytes) snapshot=4798459904
        Total committed heap usage (bytes)=310902784
Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
File Input Format Counters
        Bytes Read=2902
File Output Format Counters
        Bytes Written=2136

* UPDATE 2 *

I realized the only reason Data-local map tasks counter is missing from the summary, is because it is 0, so it is omitted. I reconfigured the cluster through the net.topology.script.file.name parameter, so each node would be a separate rack (see bash example in Hadoop Manual).

Now Hadoop proudly reports, the the performed task was not even rack-local. It seems like the scheduler (I use the default CapacityScheduler) does not care about data or rack locality at all!

Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Other local map tasks=1

Hadoop tries to achieve data locality. It may fail due to various reasons and in such cases it attempts the task in different nodes. Can you please briefly explain how you determined the physical location of the input file? — franklinsijo, Jan 24 '17 at 15:01
The Hadoop task succeeded on first try according to the output, no retry was required. I also tested it for multiple files. I checked the physical location by using the web interface of the NameNode (which by default is on port 50070). The file system utility there shows which nodes are storing which blocks of the file. (In this case the file is 1 block.) — mcserep, Jan 24 '17 at 15:07
Does the same thing occur when you try to do 'normal' jobs on the file? — Dennis Jaheruddin, Jan 24 '17 at 15:26
@franklinsijo On one of the clusters where I tested it was 2, on the other it was 3. — mcserep, Jan 24 '17 at 16:13
Okay and Are you sure the task was not executed on any of those two nodes? — franklinsijo, Jan 24 '17 at 16:15
@franklinsijo Yes, I verified it by the JobTracker web interface. I also entered the machine stated there via SSH and the logs with the matching application_id were there. No logs were present on the nodes where the data was stored. — mcserep, Jan 24 '17 at 16:39
@DennisJaheruddin, actually I tested it now with the classical WordCount example (https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) with "normal" jobs (`hadoop jar ...`) and even for those, the selected node for execution is neither of the nodes where the input file was stored. Could it be a misconfiguration of my Hadoop cluster? What should I check? — mcserep, Jan 24 '17 at 16:41
I updated the question with further info I collected. Please share any advice you have. — mcserep, Jan 24 '17 at 19:25
Can you check for warning `Failed to resolve address` in your yarn logs? — franklinsijo, Jan 24 '17 at 20:16
@franklinsijo No match for the entire logs directory. I can also ping one node from another through its hostname (if that is what you aimed to check), so resolving them should work for Hadoop too. — mcserep, Jan 24 '17 at 21:43
Once the Input splits are computed, a list of datalocalhosts are also generated. Each host in this list is resolved first. If resolve fails, it will log the error in the yarn logs. In this case that does not seem to be the issue. — franklinsijo, Jan 25 '17 at 04:25
Yes, they are the same across all nodes. Actually, hostnames are resolved there, it contains all nodes and the corresponding correct local IP addresses. — mcserep, Jan 25 '17 at 05:43
New and further confusing experiments with multiple racks added to the question. — mcserep, Jan 25 '17 at 06:43

Data locality not achieved for Hadoop jobs

0 Answers0