Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions
13
votes
13 answers

Any command to get active namenode for nameservice in hadoop?

The command: hdfs haadmin -getServiceState machine-98 Works only if you know the machine name. Is there any command like: hdfs haadmin -getServiceState which can tell you the IP/hostname of the active namenode?
Dragonborn
  • 1,755
  • 1
  • 16
  • 37
13
votes
1 answer

Spark Indefinite Waiting with "Asked to send map output locations for shuffle"

My jobs often hang with this kind of message: 14/09/01 00:32:18 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@*:37619 Would be great if someone could explain what Spark is doing when it spits out…
thegeek
  • 232
  • 2
  • 8
13
votes
1 answer

Differences between hadoop jar and yarn -jar

what's the difference between run a jar file with commands "hadoop jar " and "yarn -jar " ? I've used the "hadoop jar" command on my MAC successfully but I want be sure that the execution is being correct and parallel on my four cores. Thanks!!!
mrcf
  • 149
  • 2
  • 9
12
votes
7 answers

Yarn add raise error Missing list of packages to add to your project

After reinstall of my Kubuntu 18 I tried to run my @vue/cli 4.0.5 / vuex 3 app and got error : error Missing list of packages to add to your project serge@AtHome:/mnt/_work_sdb8/wwwroot/lar/VApps/vtasks$ node…
mstdmstd
  • 2,195
  • 17
  • 63
  • 140
12
votes
1 answer

NoClassDefFoundError org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollectorManager

Getting this error while I executed start-all.cmd command. Also I am unable to access http://localhost:8088 but I am able to acceess http://localhost:9870 The error code below is from the Resource Manager command prompt FATAL…
Hitesh Somani
  • 620
  • 4
  • 11
  • 16
12
votes
2 answers

Setting YARN queue in PySpark

When creating a Spark context in PySpark, I typically use the following code: conf = (SparkConf().setMaster("yarn-client").setAppName(appname) .set("spark.executor.memory", "10g") .set("spark.executor.instances", "7") …
Tim
  • 2,756
  • 1
  • 15
  • 31
12
votes
1 answer

How to get the progress bar (with stages and tasks) with yarn-cluster master?

When running a Spark Shell query using something like this: spark-shell yarn --name myQuery -i ./my-query.scala Inside my query is simple Spark SQL query where I read parquet files and run simple queries and write out parquet files. When running…
swdev
  • 2,941
  • 2
  • 25
  • 37
12
votes
1 answer

How to choose the queue for Spark job using spark-submit?

Is there a way to provide parameters or settings to choose the queue in which I'd like my spark_submit job to run?
asarapure
  • 605
  • 1
  • 6
  • 18
12
votes
2 answers

Amazon EMR - how to set a timeout for a step

is there a way to set a timeout for a step in Amazon Aws EMR? I'm running a batch Apache Spark job on EMR and I would like the job to stop with a timeout if it doesn't end within 3 hours. I cannot find a way to set a timeout not in Spark, nor in…
Erica
  • 1,608
  • 2
  • 21
  • 32
12
votes
1 answer

How to solve yarn container sizing issue on spark?

I want to launch some pyspark jobs on YARN. I have 2 nodes, with 10 GB each. I am able to open up the pyspark shell like so: pyspark Now when I have a very simple example that I try to launch: import random NUM_SAMPLES=1000 def inside(p): x,…
makansij
  • 9,303
  • 37
  • 105
  • 183
12
votes
3 answers

yarn uninitialized constant Socket::SOL_TCP

I'm trying to use yarn here and got into a problem that might be related to ruby. On executing any yarn command, I get the error .../.rvm/gems/ruby-2.3.0/gems/yarn-0.1.1/lib/yarn/server.rb:14:in ': uninitialized constant…
Guilherme
  • 503
  • 4
  • 15
12
votes
3 answers

spark on yarn, Container exited with a non-zero exit code 143

I am using HDP 2.5, running spark-submit as yarn cluster mode. I have tried to generate data using dataframe cross join. i.e val generatedData = df1.join(df2).join(df3).join(df4) generatedData.saveAsTable(...).... df1 storage level is…
David H
  • 1,346
  • 3
  • 16
  • 29
12
votes
3 answers

Livy Server on Amazon EMR hangs on Connecting to ResourceManager

I'm trying to deploy a Livy Server on Amazon EMR. First I built the Livy master branch mvn clean package -Pscala-2.11 -Pspark-2.0 Then, I uploaded it to the EMR cluster master. I set the following…
matheusr
  • 567
  • 9
  • 29
12
votes
4 answers

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark). I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local…
aaa90210
  • 11,295
  • 13
  • 51
  • 88
12
votes
1 answer

Running Spark on YARN in yarn-cluster mode: Where does the console output go?

I followed this page and ran the SparkPi example application on YARN in yarn-cluster mode. http://spark.apache.org/docs/latest/running-on-yarn.html I don't see the output of the program at the end (which is the result of the computation in this…
Muffintop
  • 543
  • 4
  • 11