How to decrease number of map sweeps in a job (without changing data chunk size)?

Question

The gist of my problem is..how does one decrease the number of map sweeps a job may need ? The number of map tasks for a job is data_size/HDFS_BLOCK_SIZE. The number of sweeps it may take to complete this is dependent on how many map slots we have. Assuming I am running nothing else and just one job, I find that the per node CPU utilization is low (implying I could actually run more map jobs per node). I played with mapred.tasktracker.map.tasks.maximum parameter (for example, each of my node has 32 processors and I set it to as high as 30) - but I could never increase the number of map slots and the overall CPU utilization is 60% or so. Are there any other parameters to play with? The data size I have is large enough (32GB, 8 node cluster each with 32 cpus) and it does take two map sweeps (first sweep does map 1-130 and second sweep completes the rest).

are you trying to increase or decrease the number of maps in your program? did you check this? http://stackoverflow.com/questions/10448204/how-to-increase-the-mappers-and-reducers-in-hadoop-according-to-number-of-instan/11513527#11513527 — Animesh Raj Jha, Aug 04 '12 at 06:02
No. I am not changing the number of maps and don't change the split/block size. Essentially, I want to use the same number of maps(so each data block size = HDFS block size = 128 MB in my case), but want to decrease the map sweeps. The issue I am seeing is I don't know how to have more map slots open up. — user428900, Aug 04 '12 at 06:07
what is map sweeps? as much I understand you want to increase no. of maps per node. Each map works on splits, so you must have more splits for more maps. — Animesh Raj Jha, Aug 04 '12 at 18:09
You cannot have more maps than total number of splits of your input file. Also note that when you are decreasing the split size you are not changing your data block size. You are just telling the framework how many times you want to break your data, so that each map gets one or more chunk ie one or more split. Also you have already configured max.no maps to 30 per node. So you should be able to execute 30*8 maps in cluster, if you have sufficient number of splits — Animesh Raj Jha, Aug 04 '12 at 18:26
By sweeps, I mean the number of map waves that would need to happen before all map jobs are complete. For my specific case, I don't see 240 maps happening at the same time! I do have enough splits. The data is ~32GB = ~249 blocks of 128MB (my HDFS block size). You would expect a first wave of 240 maps happening across my cluster - but I only see 120 maps (from log you can find it). Then the second wave starts and finish the job and wonder why the first wave doesn't use more cpus. — user428900, Aug 04 '12 at 19:19
249 blocks does not mean you have 249 splits, you might have splits size bigger than the block size, set smaller split size and try it , conf.set("mapred.max.split.size", "1020"); Job job = new Job(conf, "My job name"); — Animesh Raj Jha, Aug 04 '12 at 20:47
split size is the chunk size that is given to each map. one split might have two or more blocks. — Animesh Raj Jha, Aug 04 '12 at 21:23
There are 249 blocks and 249 splits. How do I know ? You can figure out from the log how much data each map task handled (it has identifiers offset+length) — user428900, Aug 04 '12 at 23:18

score 0 · Answer 1 · answered Aug 04 '12 at 09:24

0

In case anyone haven't told you yet: MapReduce is mainly IO bound, it has to read a lot of data from disk, write it back, read it and write it again. In between the reads and writes it executes your map and reduce logic.

So what I have heard lifting the CPU usage is making a cluster not IO bound anymore

RAID-0 or RAID-10 your hard disks, get the fastest harddisk out there. In consumer market there are the Western Digital VelociRaptors with 10k RPM.
SSD's don't contribute too much, since Hadoop is mostly optimized for sequencial rads.
Give as much network bandwidth as possible.
Lots of RAM for disk caching.

Even then, you should face <100% CPU utilization, but it is much better and the perfomance will skyrocket.

However, CPU utilization is not a good metric for a Hadoop cluster, as you might conclude from the points above. Hadoop is mainly about the reliable storage of data, giving neat features to crunch it. Not given you the super-computer performance, if you need this get a MPI cluster and a PH.D to code your algorithms ;)

answered Aug 04 '12 at 09:24

Thomas Jungblut

20,854
6
68
91

In my case, I know for sure the code is CPU bound (I have 1Gb network, lots of memory and fast SSDs). BTW, I am not sure why you are saying SSD's don't contribute much! SSDs offer much better bandwidth and there is no seek latency (huge factor during random i/o). However, I agree with you in general that CPU utilization may not be a good metric. I am just trying to understand why I am not able to get my CPU util. higher with my cpu bound code. – user428900 Aug 04 '12 at 15:30
Hadoop has no random IO. See http://hadoopblog.blogspot.de/2012/05/hadoop-and-solid-state-drives.html for more explanation. Also I can't say much about the CPU boundaries when I haven't seen the algorithm. – Thomas Jungblut Aug 04 '12 at 15:33
The link only talks about if we would be able to get full benefits of SSD. I don't think you can claim Hadoop has no random IO. Think of a production environment with lots of people submitting jobs. The data could be random. Anyways, my issue is to merely understand why I am not able to get my cpu util up (the code is wordcount and am sure some opts are possible such as combiner etc). – user428900 Aug 06 '12 at 16:24

score 0 · Answer 2 · answered Aug 17 '12 at 03:29

0

Sorry for the thrash - but something must have gone wrong with my installation. I happen to reinstall hadoop and it works as expected. I guess some parameter must have been conflicting.

answered Aug 17 '12 at 03:29

user428900

7,091
6
23
18

How to decrease number of map sweeps in a job (without changing data chunk size)?

2 Answers2