Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions
43
votes
4 answers

How to limit the number of retries on Spark job failure?

We are running a Spark job via spark-submit, and I can see that the job will be re-submitted in the case of failure. How can I stop it from having attempt #2 in case of yarn container failure or whatever the exception be? This happened due to lack…
jk-kim
  • 1,136
  • 3
  • 12
  • 20
43
votes
4 answers

Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind. Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task…
Sporty
  • 741
  • 2
  • 8
  • 15
41
votes
1 answer

Difference between `yarn.scheduler.maximum-allocation-mb` and `yarn.nodemanager.resource.memory-mb`?

What is difference between yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb? I see both of these in yarn-site.xml and I see the explanations here. yarn.scheduler.maximum-allocation-mb is given the following definition:…
makansij
  • 9,303
  • 37
  • 105
  • 183
39
votes
8 answers

Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"?

I am trying to setup a single-node Hadoop 2.6.0 cluster on my PC. On visiting http://localhost:8088/cluster, I find that my node is listed as an "unhealthy node". In the health report, it provides the error: 1/1 local-dirs are bad:…
Ra41P
  • 744
  • 1
  • 9
  • 18
37
votes
1 answer

Do exit codes and exit statuses mean anything in spark?

I see exit codes and exit statuses all the time when running spark on yarn: Here are a few: CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM ...failed 2 times due to AM Container for application_1431523563856_0001_000002 exited with …
makansij
  • 9,303
  • 37
  • 105
  • 183
37
votes
4 answers

How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?

I have been trying to run spark-shell in YARN client mode, but I am getting a lot of ClosedChannelException errors. I am using spark 2.0.0 build for Hadoop 2.6. Here are the exceptions : $ spark-2.0.0-bin-hadoop2.6/bin/spark-shell --master yarn…
aks
  • 1,019
  • 1
  • 9
  • 17
35
votes
10 answers

Hadoop: Connecting to ResourceManager failed

After installing hadoop 2.2 and trying to launch pipes example ive got the folowing error (the same error shows up after trying to launch hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount someFile.txt /out): /usr/local/hadoop$ hadoop pipes…
user3102852
  • 629
  • 1
  • 6
  • 7
34
votes
5 answers

Permission Denied error while running start-dfs.sh

I am getting this error while performing start-dfs.sh Starting namenodes on [localhost] pdsh@Gaurav: localhost: rcmd: socket: Permission denied Starting datanodes pdsh@Gaurav: localhost: rcmd: socket: Permission denied Starting secondary namenodes…
Gaurav A Dubey
  • 641
  • 1
  • 6
  • 19
34
votes
5 answers

How to log using log4j to local file system inside a Spark application that runs on YARN?

I'm building an Apache Spark Streaming application and cannot make it log to a file on the local filesystem when running it on YARN. How can achieve this? I've set log4.properties file so that it can successfully write to a log file in /tmp…
Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
34
votes
5 answers

How can I access S3/S3n from a local Hadoop 2.6 installation?

I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster. I have added the…
doublebyte
  • 1,225
  • 3
  • 13
  • 22
32
votes
2 answers

Apache Hadoop Yarn - Underutilization of cores

No matter how much I tinker with the settings in yarn-site.xml i.e using all of the below…
Abbas Gadhia
  • 14,532
  • 10
  • 61
  • 73
30
votes
4 answers

How to restart yarn on AWS EMR

I am using Hadoop 2.6.0 (emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to bring the changes into effect. Is there a command using which I can do this?
nish
  • 6,952
  • 18
  • 74
  • 128
30
votes
2 answers

Understand Spark: Cluster Manager, Master and Driver nodes

Having read this question, I would like to ask additional questions: The Cluster Manager is a long-running service, on which node it is running? Is it possible that the Master and the Driver nodes will be the same machine? I presume that there…
Rami
  • 8,044
  • 18
  • 66
  • 108
28
votes
3 answers

Why does vcore always equal the number of nodes in Spark on YARN?

I have a Hadoop cluster with 5 nodes, each of which has 12 cores with 32GB memory. I use YARN as MapReduce framework, so I have the following settings with…
Rui
  • 3,454
  • 6
  • 37
  • 70
27
votes
1 answer

How does Spark running on YARN account for Python memory usage?

After reading through the documentation I do not understand how does Spark running on YARN account for Python memory consumption. Does it count towards spark.executor.memory, spark.executor.memoryOverhead or where? In particular I have a PySpark…
domkck
  • 1,146
  • 1
  • 9
  • 19