Extreme amount of overhead in simple MapReduce job

Question

I'm experimenting with Hadoop and created a very simply map and reduce job. The input is a 30 line text file, and the output is only 3 lines (it's an excerpt of a log file, where the map extracts a page name and the execution time, while the reduce calculates min, max and avg execution times).

This simple job takes some 36 seconds to execute on Hadoop in Pseudo-Distributed mode (fs.default.name=hdfs://localhost, dfs.replication=1, mapred.job.tracker=localhost:8021). This is on a 2.93Ghz Nehalem, 8GB memory, X25-E SSD running Ubuntu 10.04.

I added debug output in the mapper and reducer for each call to it, which made clear that every call to the mapper resp. reducer happens in the same second. In other words, the overhead is before and after Hadoop actually calls my code.

This is final output:

18:16:59 INFO input.FileInputFormat: Total input paths to process : 1
18:16:59 INFO mapred.JobClient: Running job: job_201108251550_0023
18:17:00 INFO mapred.JobClient:  map 0% reduce 0%
18:17:15 INFO mapred.JobClient:  map 100% reduce 0%
18:17:30 INFO mapred.JobClient:  map 100% reduce 100%
18:17:35 INFO mapred.JobClient: Job complete: job_201108251550_0023
18:17:35 INFO mapred.JobClient: Counters: 25
18:17:35 INFO mapred.JobClient:   Job Counters 
18:17:35 INFO mapred.JobClient:     Launched reduce tasks=1
18:17:35 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11305
18:17:35 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
18:17:35 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
18:17:35 INFO mapred.JobClient:     Launched map tasks=1
18:17:35 INFO mapred.JobClient:     Data-local map tasks=1
18:17:35 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=12762
18:17:35 INFO mapred.JobClient:   File Output Format Counters 
18:17:35 INFO mapred.JobClient:     Bytes Written=140
18:17:35 INFO mapred.JobClient:   FileSystemCounters
18:17:35 INFO mapred.JobClient:     FILE_BYTES_READ=142
18:17:35 INFO mapred.JobClient:     HDFS_BYTES_READ=7979
18:17:35 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=41899
18:17:35 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=140
18:17:35 INFO mapred.JobClient:   File Input Format Counters 
18:17:35 INFO mapred.JobClient:     Bytes Read=7877
18:17:35 INFO mapred.JobClient:   Map-Reduce Framework
18:17:35 INFO mapred.JobClient:     Reduce input groups=3
18:17:35 INFO mapred.JobClient:     Map output materialized bytes=142
18:17:35 INFO mapred.JobClient:     Combine output records=0
18:17:35 INFO mapred.JobClient:     Map input records=30
18:17:35 INFO mapred.JobClient:     Reduce shuffle bytes=0
18:17:35 INFO mapred.JobClient:     Reduce output records=3
18:17:35 INFO mapred.JobClient:     Spilled Records=12
18:17:35 INFO mapred.JobClient:     Map output bytes=124
18:17:35 INFO mapred.JobClient:     Combine input records=0
18:17:35 INFO mapred.JobClient:     Map output records=6
18:17:35 INFO mapred.JobClient:     SPLIT_RAW_BYTES=102
18:17:35 INFO mapred.JobClient:     Reduce input records=6

As can be seen, it takes 16 seconds until the map is done, then 15 seconds until the reduce is done, and then still 6 seconds till the job is considered done.

Here's the syslog of just the reduce task:

18:17:14,894 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
18:17:15,007 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
18:17:15,088 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=130514944, MaxSingleShuffleLimit=32628736
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for merging on-disk files
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for merging in memory files
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread waiting: Thread for merging on-disk files
18:17:15,094 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Need another 1 map output(s) where 0 is already in progress
18:17:15,094 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for polling Map Completion Events
18:17:15,095 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
18:17:20,095 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
18:17:21,103 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined.
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 0 files left.
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 1 files left.
18:17:21,110 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
18:17:21,110 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 138 bytes
18:17:21,113 INFO org.apache.hadoop.mapred.ReduceTask: Merged 1 segments, 138 bytes to disk to satisfy reduce memory limit
18:17:21,114 INFO org.apache.hadoop.mapred.ReduceTask: Merging 1 files, 142 bytes from disk
18:17:21,114 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
18:17:21,114 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
18:17:21,117 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 138 bytes
18:17:21,163 INFO org.apache.hadoop.mapred.Task: Task:attempt_201108251550_0023_r_000000_0 is done. And is in the process of commiting
18:17:24,170 INFO org.apache.hadoop.mapred.Task: Task attempt_201108251550_0023_r_000000_0 is allowed to commit now
18:17:24,180 INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_201108251550_0023_r_000000_0' to output.txt
18:17:27,088 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201108251550_0023_r_000000_0' done.
18:17:27,091 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
18:17:27,110 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
18:17:27,110 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName akira for UID 1000 from the native implementation

My actual reducer code is called at 18:17:21.

What gives? Is this normal?

Hadoop is better suited for jobs larger than 30 lines. It scales up, but it does not scale down.. — Thilo, Aug 26 '11 at 10:45
Apparently it doesn't, but since it's just a job submit to a server that is already running on localhost that reads data that is already there etc, what on earth is it doing that takes over 30 seconds? — akira, Aug 26 '11 at 11:35
(p.s. in comparison, in that time I can cold boot my entire system into a graphical desktop with several apps open, which means dozens and dozens of processes are started, services initialized, etc etc. Why would a simple job submit has such a huge amount of overhead?) — akira, Aug 26 '11 at 11:37
Similar query has been raised before http://stackoverflow.com/questions/6953077/hadoop-wordcount-example-problem-do-i-need-to-do-some-performance-tuning — Praveen Sripati, Aug 26 '11 at 12:29
Indeed, it talks about JVM startup though (but this is ~100ms on my system for a bare hello world app, even Eclipse which loads thousands upon thousands of classes fully starts up in less than 2 seconds), it also talks about Hadoop needing to start, but I already started it before using start-dfs.sh and start-mapred.sh. — akira, Aug 26 '11 at 12:38

score 7 · Answer 1 · answered Aug 26 '11 at 11:08

7

When your file is 300 million lines, then those 30 odd seconds of overhead will be negligible.

answered Aug 26 '11 at 11:08

pap

27,064
6
41
46

I agree with you. But in other words: these 30 seconds are perfectly normal? – akira Aug 26 '11 at 11:32

score 2 · Answer 2 · answered Aug 26 '11 at 16:53

This is all normal. But why do you care?

I hope you're prototyping something for a much larger larger dataset, if not Hadoop is definitely not your solution.

Hadoop was made for running on hundreds of nodes, not one, and with the expectation of all kinds of failure modes. Systems of this size need very loose coupling to be resilient; and this includes things such as polling for jobs by tasktrackers etc. None of these things are optimised for fast startup/shutdown of task attempts.

I'd recommend drawing yourself a picture from the time you submit your job to the time the job is done that illustrates what daemons are involved and what machines they might be running on.

Scalability != Performance

>`None of these things are optimised for fast startup/shutdown of task attempts.` You're right, but still one can't help but wonder what on earth Hadoop is doing. Even taking all the things you mention into account, 36 seconds for a job submission is an insane amount of time if all the action takes place on a single local node. I've worked with Globus grids involving hundreds of nodes as well spread over two locations, with job progress trackers, heartbeats and all, on MUCH slower hardware as the OP mentions and the overhead didn't even come close to 36 seconds. — Arjan Tijms, Aug 26 '11 at 19:53

score 1 · Answer 3 · answered Aug 27 '11 at 17:41

The way I look at Hadoop is that the primary design goal was to make it scale horizontally. This design goal means that Hadoop was made to handle huge volumes efficiently. If you have a very small problem (it fits in the ram of one system and doesn't take very long) then a normal application will always outperform Hadoop. In such cases the overhead needed for scaling is overhead.

score 1 · Answer 4 · answered Oct 20 '11 at 07:36

hadoop scales up from 5 nodes on (namenode + secondary + 3 datanodes) with a RAID5 safety (3 replicas). On a single node you have the availability to test your code, not more. After you test and went into a "real" Cluster you have to know how large your dataset will be (blocksizing), also the RAM of the namenode is a important parameter (each block consumes RAM, more space, more RAM) for a first test.

Extreme amount of overhead in simple MapReduce job

4 Answers4