17

Yarn differs in its infrastructure layer from the original map reduce architecture in the following way:

In YARN, the job tracker is split into two different daemons called Resource Manager and Node Manager (node specific). The resource manager only manages the allocation of resources to the different jobs apart from comprising a scheduler which just takes care of the scheduling jobs without worrying about any monitoring or status updates. Different resources such as memory, cpu time, network bandwidth etc. are put into one unit called the Resource Container. There are different AppMasters running on different nodes which talk to a number of these resource containers and accordingly update the Node Manager with the monitoring/status details.

I want to know that how does using this kind of an approach increase the performance from the map-reduce perspective? Also, if there is any definitive content on the motivation behind Yarn and its benefits over the existing implementation of Map-reduce, please point me to the same.

twid
  • 6,368
  • 4
  • 32
  • 50
Abhishek Jain
  • 4,478
  • 8
  • 34
  • 51

5 Answers5

20

Here are some of the articles (1, 2, 3) about YARN. These talk about the benefits of using YARN.

YARN is more general than MR and it should be possible to run other computing models like BSP besides MR. Prior to YARN, it required a separate cluster for MR, BSP and others. Now they they can coexist in a single cluster, which leads to higher usage of the cluster. Here are some of the applications ported to YARN.

From a MapReduce perspective in legacy MR there are separate slots for Map and Reduce tasks, but in YARN their is no fixed purpose of a container. The same container can be used for a Map task, Reduce task, Hama BSP Task or something else. This leads to better utilization.

Also, it makes it possible to run different versions of Hadoop in the same cluster which is not possible with legacy MR, which makes is easy from a maintenance point.

Here are some of the additional links for YARN. Also, Hadoop: The Definitive Guide, 3rd Edition has an entire section dedicated to YARN.

FYI, it had been a bit controversial to develop YARN instead of using some of frameworks which had been doing something similar and had been running for ages successfully with bugs ironed out.

Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
7

I do not think that Yarn will speedup the existing MR framework. Looking into architecture we can see that the system now is more modular - but modularity usually contradicts higher performance.
It can be claimed that YARN has nothing to do with MapReduce. MapReduce just became one of the YARN applications. You can see it as moving from some embedded program to embeded OS with program within it
At the same time Yarn opens the door for different MR implementations with different frameworks. For example , if we assume that our dataset is smaller then cluster memory we can get much better performance. I think http://www.spark-project.org/ is one such example
To summarize it: Yarn does not improve the existing MR, but will enable other MR implementations to be better in all aspects.

Abhishek Jain
  • 4,478
  • 8
  • 34
  • 51
David Gruzman
  • 7,900
  • 1
  • 28
  • 30
  • I agree that Yarn has the added benefit of running other paradigms or a different version of Map Reduce in the same cluster but it would be wrong to say that it does not bring any performance benefits. Yarn increases the utilisation of the cluster resources in a way where there are no predefined map or reduce slots and each job is free to ask for as many resources as it needs and these resources comprise CPU time and memory as well. If there is greater utilisation of the resources, then it is naturally going to increase the performance of each individual job. – Abhishek Jain Oct 21 '12 at 11:41
  • 1
    I agree that Yarn gives much more flexibility and lead to better utilization. In the same time I do not expect Yarn to improve performance of single job in most cases. – David Gruzman Oct 21 '12 at 12:41
  • I still cannot understand that completely. Can you throw some more light on it. I am looking at it from the following perspective - I am trying to run a lot of mappers on the cluster and the reduce slots are sitting idle (talking about the older map reduce paradigm). In that case, the mappers cannot utilize these ideal resources and run, instead they need to wait when some resources get freed by the other mapper jobs. Hence, there is a latency involved in this case which goes away in case of Yarn. – Abhishek Jain Oct 21 '12 at 18:04
  • 1
    I see it as following: Reduce slots takes only RAM, while CPU, Disk, Network are still utilized by mappers, so gain is not that big. Hadoop is usualy CPU/IO bound but not RAM bound system. Having now running reducers during map stage, will prevent from data from finished mappers to be already sent to reducers, thus increasing latency. So in some case "late reducers" will decrease latency, in some - increase. I also see that in many cases shuffling and reducers are actually bottleneck of the job.. – David Gruzman Oct 22 '12 at 06:16
  • 1
    I agree on the last point. Correct me if I am wrong. Tasktrackers have pre-configured map/reduce slots. Imagine a situation where I have multiple independent map reduce jobs i.e. map-reduce job1 has mappers 1,2,3,4 & reducers 1,2;map-reduce job2 has mappers 4,5,6,7 & reducers 3,4. Now, imagine that these 2 jobs do not consume each others output and are not related to each other except that they are running on the same hadoop cluster. In this case, my mappers/reducers may have to wait for the slots to get free and available and thus an added latency which gets avoided in case of Yarn. – Abhishek Jain Oct 22 '12 at 14:30
  • I agree with your example. For the multiple jobs there must be cases when Yarn will improve overall cluster utilization. – David Gruzman Oct 22 '12 at 15:42
  • @AbhishekJain, I understood the problem with mapreduce 1 with predefined number of map slots and reduce slots. How does Yarn address the problem with Containers. Could u explain with same example? – Jon Andrews Apr 16 '17 at 04:14
3

All the above answers covered lot of information: I am simplifying all the information as follows:

MapReduce:                          YARN:

1. It is Platform plus Application  It is a Platform in Hadoop 2.0 and 
in Hadoop 1. 0 and it is only of    doesn't exist in Hadoop 1.0
the applications in Hadoop 2.0

2. It is single use system i.e.,    It is multi purpose system, We can run
We can run MapReduce jobs only.     MapReduce, Spark, Tez, Flink, BSP, MPP,
                                    MPI, Giraph etc... (General Purpose)

3. JobTracker scalability i.e.,     Both Resource Management and
Both Resource Management and        Application Management gets separated & 
Job Management                      managed by RM+NM, Paradigm specific AMs
                                    respectively.

4. Poor Resource Management         Flexible Resource Management i.e., 
system i.e., slots (map/reduce)     containers.

5. It is not highly available       High availability and reliability.

6. Scaled out up to 5000 nodes      Scaled out 10000 plus nodes.

7. Job->tasks                        Application -> DAG of Jobs -> tasks

8. Classical MapReduce = MapReduce  Yarn MapReduce = MapReduce API +      
   API + MapReduce FrameWork        MapReduce FrameWork + YARN System
   + MapReduce System               So MR programs which were written over
                                    Hadoop 1.0 run over Yarn also with out
                                    changing a single line of code i.e.,
                                    backward compatibility.
Naga
  • 1,203
  • 11
  • 21
3

Let's see Hadoop 1.0 drawbacks, which have been addressed by Hadoop 2.0 with addition of Yarn.

  1. Issue of Scalability : Job Tracker runs on a single machine even though you have thousands of nodes in Hadoop cluster. The responsibilities of Job tracker : Resource management, Job and Task schedule and monitoring. Since all these processes are running on a single node, this model is not scalable.
  2. Issue of availability ( Single point of failure): Job Tracker is a single point of failure.
  3. Resource utilization: Due to predefined number of Map & Reduce task slots, resources are not utilized properly. When all Mapper nodes are busy, Reducer nodes are idle and can't be used to process Mapper tasks.
  4. Tight integration with Map Reduce framework: Hadoop 1.x can run Map reduce jobs only. Support for jobs other than Map Reduce jobs does not exists.

Now single Job Tracker bottleneck has been removed with YARN architecture in Hadoop 2.x

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
1

It looks like this link might be what you're looking for: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/.

My understanding is that YARN is supposed to be more generic. You can create your own YARN applications that negotiate directly with the Resource Manager for resources (1), and MapReduce is just one of several Application Managers that already exist (2).

Ben McCracken
  • 384
  • 2
  • 6