Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions
18
votes
4 answers

Apache Spark YARN mode startup takes too long (10+ secs)

I’m running a spark application with YARN-client or YARN-cluster mode. But it seems to take too long to startup. It takes 10+ seconds to initialize the spark context. Is this normal? Or can it be optimized? The environment is as follows: Hadoop:…
zeodtr
  • 10,645
  • 14
  • 43
  • 60
18
votes
1 answer

yarn is not honouring yarn.nodemanager.resource.cpu-vcores

I am using Hadoop-2.4.0 and my system configs are 24 cores, 96 GB RAM. I am using following…
banjara
  • 3,800
  • 3
  • 38
  • 61
18
votes
1 answer

Slurm: What is the difference for code executing under salloc vs srun

I'm using a cluster managed by slurm to run some yarn/hadoop benchmarks. To do this I am starting the hadoop servers on nodes allocated by slurm and then running the benchmarks on them. I realize that this is not the intended way to run a production…
Daniel Goodman
  • 273
  • 1
  • 2
  • 7
17
votes
1 answer

Why increase spark.yarn.executor.memoryOverhead?

I am trying to join two large spark dataframes and keep running into this error: Container killed by YARN for exceeding memory limits. 24 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. This seems like a…
Fortunato
  • 567
  • 6
  • 18
17
votes
2 answers

Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark

I am new apache-spark. I have tested some application in spark standalone mode.but I want to run application yarn mode.I am running apache-spark 2.1.0 in windows.Here is My code c:\spark>spark-submit2 --master yarn --deploy-mode client…
Kalyan
  • 1,880
  • 11
  • 35
  • 62
17
votes
3 answers

Spark Launcher waiting for job completion infinitely

I am trying to submit a JAR with Spark job into the YARN cluster from Java code. I am using SparkLauncher to submit SparkPi example: Process spark = new SparkLauncher() …
TomaszGuzialek
  • 861
  • 1
  • 8
  • 15
17
votes
3 answers

Spark off heap memory leak on Yarn with Kafka direct stream

I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with java 1.8.0_45 and also Kafka direct stream. I am also using spark with scala 2.11 support. The issue I am seeing is that both driver and executor containers are gradually…
17
votes
1 answer

Apache Helix vs YARN

What is the difference between Apache Helix and Hadoop YARN (MRv2). Does anyone have experience with both technologies? Can someone explain me the advantages/disadvantages of Helix over YARN and why the LinkedIn guys developed their own cluster…
Tobi
  • 219
  • 2
  • 6
17
votes
5 answers

What additional benefit does Yarn bring to the existing map reduce?

Yarn differs in its infrastructure layer from the original map reduce architecture in the following way: In YARN, the job tracker is split into two different daemons called Resource Manager and Node Manager (node specific). The resource manager…
Abhishek Jain
  • 4,478
  • 8
  • 34
  • 51
16
votes
5 answers

Spark: get number of cluster cores programmatically

I run my spark application in yarn cluster. In my code I use number available cores of queue for creating partitions on my dataset: Dataset ds = ... ds.coalesce(config.getNumberOfCores()); My question: how can I get number available cores of queue…
Rougher
  • 834
  • 5
  • 19
  • 46
16
votes
1 answer

How to change yarn scheduler configuration on aws EMR?

Unlike HortonWorks or Cloudera, AWS EMR does not seem to give any GUI to change xml configurations of various hadoop ecosystem frameworks. Logging into my EMR namenode and doing a quick find \ -iname yarn-site.xml I was able to find it to be…
Kumar Vaibhav
  • 2,632
  • 8
  • 32
  • 54
16
votes
4 answers

What is the correct way to start/stop spark streaming jobs in yarn?

I have been experimenting and googling for many hours, with no luck. I have a spark streaming app that runs fine in a local spark cluster. Now I need to deploy it on cloudera 5.4.4. I need to be able to start it, have it run in the background…
Kevin Pauli
  • 8,577
  • 15
  • 49
  • 70
16
votes
6 answers

YARN Resourcemanager not connecting to nodemanager

thanks in advance for any help I am running the following versions: Hadoop 2.2 zookeeper 3.4.5 Hbase 0.96 Hive 0.12 When I go to http://:50070 I am able to correctly see that 2 nodes are running. The problem is when I go to http://:8088 it shows 0…
Aman Chawla
  • 704
  • 2
  • 8
  • 25
15
votes
2 answers

Spark Driver memory and Application Master memory

Am I understanding the documentation for client mode correctly? client mode is opposed to cluster mode where the driver runs within the application master? In client mode the driver and application master are separate processes and therefore…
user782220
  • 10,677
  • 21
  • 72
  • 135
15
votes
1 answer

--files option in pyspark not working

I tried sc.addFile option (working without any issues) and --files option from the command line (failed). Run 1 : spark_distro.py from pyspark import SparkContext, SparkConf from pyspark import SparkFiles def import_my_special_package(x): from…
goks
  • 1,196
  • 3
  • 18
  • 37