Map Job Performance on cluster

Question

Suppose I have 15 blocks of data and two clusters. The first cluster has 5 nodes and a replication factor is 1, while the second one has a replication factor is 3. If I run my map job, should I expect any change in the performance or the execution time of the map job?

In other words, how does replication affect the performance of the mapper on a cluster?

ChuckCottrill · Accepted Answer · 2013-10-15T19:50:49.533

When the JobTracker assigns a job to a TaskTracker on HDFS, a job is assigned to a particular node based upon locality of data (preference is same node first, then same network switch/frame). By having different replication factors, you limit the ability for the JobTracker to assign a node local to the data (JobTracker will still assign the task nodes, but without the benefits of locality). The effect is to restrict the number of TaskTracker nodes which are both local to the data (either data on task node, or data on same switch frame), thus affecting performance for work on your task (reducing parallelization).

Your smaller cluster likely has a single switch, so data is local to the network/frame, so the only bottleneck you might experience would be data transfer from one TaskTracker to another, as the JobTracker is likely to assign jobs to all available TaskTrackers.

But with a larger hadoop cluster, the replication factor = 1 would limit the number of TaskTracker nodes local to the data and thus able to efficiently operate on your data.

There are several papers which support data locality, http://web.eecs.umich.edu/~michjc/papers/tandon_hpdic_minimizeRemoteAccess.pdf, this paper which you cited also supports data locality, http://assured-cloud-computing.illinois.edu/sites/default/files/PID1974767.pdf, and this one, http://www.eng.auburn.edu/~xqin/pubs/hcw10.pdf (which tested a 5 node cluster, same as the OP).

This paper quotes significant benefits to data locality, http://grids.ucs.indiana.edu/ptliupages/publications/InvestigationDataLocalityInMapReduce_CCGrid12_Submitted.pdf, and observes that an increase in replication factor gives better locality.

Note that this paper claims little difference between network throughput and local disk access (8%), http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf, but reports orders of magnitude difference in performance between local memory access and either disk or network access. Furhtermore, the paper quotes a large fraction of jobs (64%) find their data cached in memory "in large part due to the heavy-tailed nature of the workload", as most jobs "access only a small fraction of the blocks".

Indeed, the problem with replication factor 1 is the same on small and big clusters, but if you have a small cluster you might have space issues and less network communication. Perhaps you are also only interested in experimenting with Hadoop a bit, so the tradeoff might be worthwhile. If you have a larger cluster with sufficient resources, you sacrifice both performance due to additional network communication and reliability since you will lose blocks to storage failure. — Alex A., Oct 15 '13 at 04:51
Parallelism is not reduced. You can run in the same number of nodes in parallel, regardless of whether you can run with data locality or not. — cabad, Oct 15 '13 at 14:29
Yes, data locality increases significantly with more replicas. However, it does not affect the performance much (e.g., job completion times). The question was not about improving locality, but about improving performance. — cabad, Oct 15 '13 at 17:44

cabad · Answer 2 · 2013-10-15T17:45:45.700

EDIT: This part of my answer is obsolete now that the other answer was edited: "The other answer is not entirely correct." This was meant to address the incorrect implication that less replicas = fewer paralelism. The rest of my answer (below) still applies.

Any node can execute your tasks, regardless of whether the data is located in that node or not. Hadoop will try to achieve data locality (preference order is: node-local, then rack-local, then any node), but if it can't, then it will chose any node that has available compute capacity to run your task.

Performance wise, in a typical multi-rack installation, rack-local performs almost as good as node-local, since the bottleneck occurs when transmitting data across racks. However, with high-end networking equipment (i.e., full-bisection bandwidth), then it wouldn't matter if your computation is rack-local or not. For more details on this, read this paper.

How much performance improvement can you expect from having more replicas (and thus achieving higher data locality)? Not much; 5-20% maximum improvement. But this is an upper limit, when you implement additional popularity-based replication as in this and this projects. NOTE: I did not just make-up those performance improvement numbers; they come from the papers I linked.

Since vanilla Hadoop does not have these mechanisms in place, I would expect your performance to improve at most 1-5%. This is just a ballpark guess, but you can easily run some tests yourself. The reason for this, is that more replicas could improve the performance of some of your map tasks (the ones that are now able to run with a data-local copy of the block), but it would not improve your shuffle and reduce phases. Furthermore, even if just one mapper takes longer than the rest, this one will determine the length of your whole map phase; so for many jobs, it is likely that increasing locality will not improve their running times at all. Finally, I/O bound jobs can be map-input IO bound, shuffle IO bound (map output heavy), or reduce output IO bound. Only the first type (map-input IO bound) would benefit from locality. More details on MapReduce workload characterization in this paper.

If you are further interested in this, you can also read this paper, in which they improve the running times of mappers but having the input data of ALL the mappers in memory.

The question was what effect replication would have on the mapper, and this is a big question, because data locality has been found important (see citations) to performance. A better understanding of the OP's workload is needed. — ChuckCottrill, Oct 15 '13 at 17:51
Ok, I get that. But let me explain better: A single mapper can see its performance improved by data locality. However, the map phase (I guess this is what he means by map job) will not be affected by this, since the map phase duration is affected by stragglers (single tasks that take longer for whatever reason). That is why, it doesn't really matter. I am not making this up. I've tested 100% locality (easy to do, just use replication factor = number of nodes in cluster) and observed the results. Improvement is minimal at the job level; the reason for this is: stragglers + shuffle phase. — cabad, Oct 15 '13 at 18:02

Map Job Performance on cluster

2 Answers2

Linked