Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions
21
votes
4 answers

Spark resources not fully allocated on Amazon EMR

I'm trying to maximize cluster usage for a simple task. Cluster is 1+2 x m3.xlarge, runnning Spark 1.3.1, Hadoop 2.4, Amazon AMI 3.7 The task reads all lines of a text file and parse them as csv. When I spark-submit a task as a yarn-cluster mode, I…
Michel Lemay
  • 2,054
  • 2
  • 17
  • 34
20
votes
1 answer

Why does Spark job fail with "Exit code: 52"

I have had Spark job failing with a trace like this one: ./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-Container id:…
Virgil
  • 3,022
  • 2
  • 19
  • 36
20
votes
2 answers

YARN not preempting resources based on fair shares when running a Spark job

I have a problem with re-balancing Apache Spark jobs resources on YARN Fair Scheduled queues. For the tests I've configured Hadoop 2.6 (tried 2.7 also) to run in pseudo-distributed mode with local HDFS on MacOS. For job submission used "Pre-build…
20
votes
2 answers

How to add configuration file to classpath of all Spark executors in Spark 1.2.0?

I'm using Typesafe Config, https://github.com/typesafehub/config, to parameterize a Spark job running in yarn-cluster mode with a configuration file. The default behavior of Typesafe Config is to search the classpath for resources with names…
MawrCoffeePls
  • 703
  • 1
  • 5
  • 14
20
votes
4 answers

How to extract application ID from the PySpark context

A previous question recommends sc.applicationId, but it is not present in PySpark, only in scala. So, how do I figure out the application id (for yarn) of my PySpark process?
sds
  • 58,617
  • 29
  • 161
  • 278
20
votes
2 answers

Where does Hadoop store the logs of YARN applications?

I run the basic example of Hortonworks' yarn application example. The application fails and I want to read the logs to figure out why. But I can't find any files at the expected location (/HADOOP_INSTALL_FOLDER/logs) where the logs of my mapreduce…
padmalcom
  • 1,156
  • 3
  • 16
  • 30
19
votes
2 answers

EMR Spark - TransportClient: Failed to send RPC

I'm getting this error, I tried to increase memory on cluster instances and in the executor and driver parameters without success. 17/05/07 23:17:07 ERROR TransportClient: Failed to send RPC 6465703946954088562 to…
Luis Sobrecueva
  • 680
  • 1
  • 6
  • 13
19
votes
2 answers

Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

I'm new to Spark on YARN and don't understand the relation between the YARN Containers and the Spark Executors. I tried out the following configuration, based on the results of the yarn-utils.py script, that can be used to find optimal cluster…
19
votes
1 answer

Aggregate Resource Allocation for a job in YARN

I am new to Hadoop. When i run a job, i see the aggregate resource allocation for that job as 251248654 MB-seconds, 24462 vcore-seconds. However, when i find the details about the cluster, it shows there are 888 Vcores-total and 15.90 TB…
blackfury
  • 675
  • 3
  • 11
  • 22
19
votes
1 answer

How to keep YARN's log files?

Suddenly, my YARN cluster has stopped working, everything I submit fails with "Exit code 1". I want to track down that problem, but as soon as an application failed, YARN deletes the log files. What is the configuration setting I have to adjust for…
rabejens
  • 7,594
  • 11
  • 56
  • 104
19
votes
5 answers

issue Running Spark Job on Yarn Cluster

I want to run my spark Job in Hadoop YARN cluster mode, and I am using the following command: spark-submit --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 …
Sachin Singh
  • 739
  • 4
  • 12
  • 29
19
votes
11 answers

Yarn MapReduce Job Issue - AM Container launch error in Hadoop 2.3.0

I have setup a 2 node cluster of Hadoop 2.3.0. Its working fine and I can successfully run distributedshell-2.2.0.jar example. But when I try to run any mapreduce job I get error. I have setup MapRed.xml and other configs for running MapReduce job…
TonyMull
  • 271
  • 1
  • 4
  • 12
18
votes
1 answer

Upload zip file using --archives option of spark-submit on yarn

I have a directory with some model files and my application has to access these models files in local file system due to some reason. Of course I know that --files option of spark-submit can upload file to the working directory of each executor and…
Mo Tao
  • 1,225
  • 1
  • 8
  • 17
18
votes
5 answers

How to specify which java version to use in spark-submit command?

I want to run a spark streaming application on a yarn cluster on a remote server. The default java version is 1.7 but i want to use 1.8 for my application which is also there in the server but is not the default. Is there a way to specify through…
Priyanka
  • 261
  • 1
  • 3
  • 10
18
votes
2 answers

How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?

I have the following code which fires hiveContext.sql() most of the time. My task is I want to create few tables and insert values into after processing for all hive table partition. So I first fire show partitions and using its output in a…
Umesh K
  • 13,436
  • 25
  • 87
  • 129