I'm experimenting with Hadoop and created a very simply map and reduce job. The input is a 30 line text file, and the output is only 3 lines (it's an excerpt of a log file, where the map extracts a page name and the execution time, while the reduce calculates min, max and avg execution times).
This simple job takes some 36 seconds to execute on Hadoop in Pseudo-Distributed mode (fs.default.name=hdfs://localhost, dfs.replication=1, mapred.job.tracker=localhost:8021). This is on a 2.93Ghz Nehalem, 8GB memory, X25-E SSD running Ubuntu 10.04.
I added debug output in the mapper and reducer for each call to it, which made clear that every call to the mapper resp. reducer happens in the same second. In other words, the overhead is before and after Hadoop actually calls my code.
This is final output:
18:16:59 INFO input.FileInputFormat: Total input paths to process : 1
18:16:59 INFO mapred.JobClient: Running job: job_201108251550_0023
18:17:00 INFO mapred.JobClient: map 0% reduce 0%
18:17:15 INFO mapred.JobClient: map 100% reduce 0%
18:17:30 INFO mapred.JobClient: map 100% reduce 100%
18:17:35 INFO mapred.JobClient: Job complete: job_201108251550_0023
18:17:35 INFO mapred.JobClient: Counters: 25
18:17:35 INFO mapred.JobClient: Job Counters
18:17:35 INFO mapred.JobClient: Launched reduce tasks=1
18:17:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11305
18:17:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
18:17:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
18:17:35 INFO mapred.JobClient: Launched map tasks=1
18:17:35 INFO mapred.JobClient: Data-local map tasks=1
18:17:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=12762
18:17:35 INFO mapred.JobClient: File Output Format Counters
18:17:35 INFO mapred.JobClient: Bytes Written=140
18:17:35 INFO mapred.JobClient: FileSystemCounters
18:17:35 INFO mapred.JobClient: FILE_BYTES_READ=142
18:17:35 INFO mapred.JobClient: HDFS_BYTES_READ=7979
18:17:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=41899
18:17:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=140
18:17:35 INFO mapred.JobClient: File Input Format Counters
18:17:35 INFO mapred.JobClient: Bytes Read=7877
18:17:35 INFO mapred.JobClient: Map-Reduce Framework
18:17:35 INFO mapred.JobClient: Reduce input groups=3
18:17:35 INFO mapred.JobClient: Map output materialized bytes=142
18:17:35 INFO mapred.JobClient: Combine output records=0
18:17:35 INFO mapred.JobClient: Map input records=30
18:17:35 INFO mapred.JobClient: Reduce shuffle bytes=0
18:17:35 INFO mapred.JobClient: Reduce output records=3
18:17:35 INFO mapred.JobClient: Spilled Records=12
18:17:35 INFO mapred.JobClient: Map output bytes=124
18:17:35 INFO mapred.JobClient: Combine input records=0
18:17:35 INFO mapred.JobClient: Map output records=6
18:17:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=102
18:17:35 INFO mapred.JobClient: Reduce input records=6
As can be seen, it takes 16 seconds until the map is done, then 15 seconds until the reduce is done, and then still 6 seconds till the job is considered done.
Here's the syslog of just the reduce task:
18:17:14,894 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
18:17:15,007 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
18:17:15,088 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=130514944, MaxSingleShuffleLimit=32628736
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for merging on-disk files
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for merging in memory files
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread waiting: Thread for merging on-disk files
18:17:15,094 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Need another 1 map output(s) where 0 is already in progress
18:17:15,094 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for polling Map Completion Events
18:17:15,095 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
18:17:20,095 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
18:17:21,103 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined.
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 0 files left.
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 1 files left.
18:17:21,110 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
18:17:21,110 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 138 bytes
18:17:21,113 INFO org.apache.hadoop.mapred.ReduceTask: Merged 1 segments, 138 bytes to disk to satisfy reduce memory limit
18:17:21,114 INFO org.apache.hadoop.mapred.ReduceTask: Merging 1 files, 142 bytes from disk
18:17:21,114 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
18:17:21,114 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
18:17:21,117 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 138 bytes
18:17:21,163 INFO org.apache.hadoop.mapred.Task: Task:attempt_201108251550_0023_r_000000_0 is done. And is in the process of commiting
18:17:24,170 INFO org.apache.hadoop.mapred.Task: Task attempt_201108251550_0023_r_000000_0 is allowed to commit now
18:17:24,180 INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_201108251550_0023_r_000000_0' to output.txt
18:17:27,088 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201108251550_0023_r_000000_0' done.
18:17:27,091 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
18:17:27,110 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
18:17:27,110 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName akira for UID 1000 from the native implementation
My actual reducer code is called at 18:17:21.
What gives? Is this normal?