Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions
15
votes
2 answers

Hadoop YARN: Get a list of available queues

Is there a way to get a list of all available YARN queues from the command line, without resorting to parsing the capacity-scheduler.xml file? I'm using Hadoop version 2.7.2
foglerit
  • 7,792
  • 8
  • 44
  • 64
15
votes
1 answer

How to submit a spark job on a remote master node in yarn client mode?

I need to submit spark apps/jobs onto a remote spark cluster. I have currently spark on my machine and the IP address of the master node as yarn-client. Btw my machine is not in the cluster. I submit my job with this command ./spark-submit --class…
Mnemosyne
  • 1,162
  • 4
  • 13
  • 45
15
votes
1 answer

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb
fo_x86
  • 2,583
  • 1
  • 30
  • 41
15
votes
7 answers

/bin/bash: /bin/java: No such file or directory error in Yarn apps in MacOS

I was trying to run a simple wordcount MapReduce Program using Java 1.7 SDK and Hadoop2.7.1 on Mac OS X EL Captain 10.11 and I am getting the following error message in my container log "stderr" /bin/bash: /bin/java: No such file or…
Gangadhar Kadam
  • 536
  • 1
  • 4
  • 15
15
votes
3 answers

Spark : multiple spark-submit in parallel

I have a generic question about Apache Spark : We have some spark streaming scripts that consume Kafka messages. Problem : they are failing randomly without a specific error... Some script does nothing while they are working when I run them…
Taoma_k
  • 303
  • 2
  • 3
  • 9
15
votes
3 answers

Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

A help for the implementation best practice is needed. The operating environment is as follows: Log data file arrives irregularly. The size of a log data file is from 3.9KB to 8.5MB. The average is about 1MB. The number of records of a data file is…
zeodtr
  • 10,645
  • 14
  • 43
  • 60
14
votes
2 answers

ECONNREFUSED during 'next build'. Works fine with 'next dev'

I have a very simple NextJS 9.3.5 project. For now, it has a single pages/users and a single pages/api/users that retrieves all users from a local MongoDB table It builds fine locally using 'next dev' But, it fails on 'next build' with ECONNREFUSED…
user2821200
  • 153
  • 1
  • 1
  • 7
14
votes
5 answers

Why does my yarn application not have logs even with logging enabled?

I have enabled logs in the xml file: yarn-site.xml, and I restarted yarn by doing: sudo service hadoop-yarn-resourcemanager restart sudo service hadoop-yarn-nodemanager restart I ran my application, and then I see the applicationID in yarn…
makansij
  • 9,303
  • 37
  • 105
  • 183
14
votes
6 answers

"Bad substitution" when submitting spark job to yarn-cluster

I am doing a smoke test against a yarn cluster using yarn-cluster as the master with the SparkPi example program. Here is the command line: $SPARK_HOME/bin/spark-submit --master yarn-cluster --executor-memory 8G --executor-cores 240 --class…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
14
votes
2 answers

How to deal with tasks running too long (comparing to others in job) in yarn-client?

We use a Spark cluster as yarn-client to calculate several business, but sometimes we have a task run too long time: We don't set timeout but I think default timeout a spark task is not too long such here ( 1.7h ). Anyone give me an ideal to work…
tnk_peka
  • 1,525
  • 2
  • 15
  • 25
14
votes
2 answers

How to get ID of a map task in Spark?

Is there a way to get ID of a map task in Spark? For example if each map task calls a user defined function, can I get the ID of that map task from whithin that user defined function?
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
14
votes
1 answer

Spark executor logs on YARN

I'm launching a distributed Spark application in YARN client mode, on a Cloudera cluster. After some time I see some errors on Cloudera Manager. Some executors get disconnected and this happens systematically. I would like to debug the issue but the…
Nicola Ferraro
  • 4,051
  • 5
  • 28
  • 60
13
votes
2 answers

How to kill an application from the ResourceManager Web UI

Is there a way of killing an application from the RM web UI instead of running yarn application -kill?
dimamah
  • 2,883
  • 18
  • 31
13
votes
2 answers

What is Memory reserved on Yarn

I managed to launch a spark application on Yarn. However memory usage is kind of weird as you can see below : https://i.stack.imgur.com/f89UP.jpg What does memory reserved mean ? How can i manage to efficiently use all the memory available ? Thanks…
Ludovic S
  • 185
  • 1
  • 2
  • 9
13
votes
5 answers

What is "Hadoop" - the definition of Hadoop?

It is kind of obvious and we will all agree that we can call HDFS + YARN + MapReduce as Hadoop. But what happens with different other combinations and other products in the Hadoop ecosystem? Is, for example, HDFS + YARN + Spark still Hadoop? Is…
neuromouse
  • 921
  • 1
  • 12
  • 32