0

I have an intuition that increasing/decreasing number of nodes interactively on running job can speed up map-heavy jobs, but won't help wth reduce heavy jobs, where most of work is done by reduce.

There's an faq about this but it doesn't really explain very well

http://aws.amazon.com/elasticmapreduce/faqs/#cluster-18

tphyahoo
  • 139
  • 1
  • 6

1 Answers1

2

This question was answered by Christopher Smith, who gave me permission to post here.


As always... "it depends". One thing you can pretty much always count on: adding nodes later on is not going to help you as much as having the nodes from the get go.

When you create a Hadoop job, it gets split up in to tasks. These tasks are effectively "atoms of work". Hadoop lets you tweak the # of mapper and # of reducer tasks during job creation, but once the job is created, it is static. Tasks are assigned to "slots". Traditionally, each node is configured to have a certain number of slots for map tasks, and a certain number of slots for reduce tasks, but you can tweak that. Some newer versions of Hadoop don't require you to designate the slots as being for map or reduce tasks. Anyway, the JobTracker periodically assigns tasks to slots. Because this is done dynamically, new nodes coming online can speed up the processing of a job by providing more slots to execute the tasks.

This sets the stage for understanding the reality of adding new nodes. There's obviously an Amdahl's law issue where having more slots than pending tasks accomplishes little (if you have speculative execution enabled, it does help somewhat, as Hadoop will schedule the same task to run on many different nodes, so that a slow node's tasks can be completed by faster nodes if there are spare resources). So, if you didn't define your job with many map or reduce tasks, adding more nodes isn't going to help much. Of course, each task imposes some overhead, so you don't want to go crazy high either. That's why I suggest a guideline for task size should be "something which takes ~2-5 minutes to execute".

Of course, when you add nodes dynamically, they have one other disadvantage: they don't have any data local. Obviously, if you are at the start of a EMR pipeline, none of the nodes have data in them, so doesn't matter, but if you have an EMR pipeline made of many jobs, with earlier jobs persisting their results to HDFS, you get a huge performance boost because the JobTracker will favour shaping and assigning tasks so nodes have that lovely locality of data (this is a core trick of the whole MapReduce design to maximize performance). On the reducer side, data is coming from other map tasks, so dynamically added nodes are really at no disadvantage as compared to other nodes.

So, in principle, dynamically adding new nodes is actually less likely to help with IO bound map tasks that are reading from HDFS.

Except...

Hadoop has a variety of cheats under the covers to optimize performance. Once is that it starts transmitting map output data to the reducers before the map task completes/the reducer starts. This obviously is a critical optimization for jobs where the mappers generate a lot of data. You can tweak when Hadoop starts to kick off the transfers. Anyway, this means that a newly spun up node might be at a disadvantage, because the existing nodes might already have such a huge data advantage. Obviously, the more output that the mappers have transmitted, the larger the disadvantage.

That's how it all really works. In practice though, a lot of Hadoop jobs have mappers processing tons of data in a CPU intensive fashion, but outputting comparatively little data to the reducers (or they might send a lot of data to the reducers, but the reducers are still very simple, so not CPU bound at all). Often jobs will have few (sometimes even 0) reducer tasks, so even extra nodes could help, if you already have a reduce slot available for every outstanding reduce task, new nodes can't help. New nodes also disproportionately help out with CPU bound work, for obvious reasons, so because that tends to be map tasks more than reduce tasks, that's where people typically see the win. If your mappers are I/O bound and pulling data from the network, adding new nodes obviously increases the aggregate bandwidth of the cluster, so it helps there, but if your map tasks are I/O bound reading HDFS, the best thing is to have more initial nodes, with data already spread over HDFS. It's not unusual to see reducers get I/O bound because of poorly structured jobs, in which case adding more nodes can help a lot, because it splits up the bandwidth again.

There's a caveat there too of course: with a really small cluster, reducers get to read a lot of their data from the mappers running on the local node, and adding more nodes shifts more of the data to being pulled over the much slower network. You can also have cases where reducers spend most of their time just multiplexing data processing from all the mappers sending them data (although that is tunable as well).

If you are asking questions like this, I'd highly recommend profiling your job using something like Amazon's offering of KarmaSphere. It will give you a better picture of where your bottlenecks are and what are your best strategies for improving performance.

tphyahoo
  • 139
  • 1
  • 6