Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions
1
vote
1 answer

long scheduler Delay in Spark UI

i am running pyspark jobs on a 2.3.0 cluster on yarn. i see that all the stages have a very long scheduler Delay. BUT - it is just the max time, the 75th precentile is 28ms .... all the other time metric are very low (GC time, task desirialization ,…
user1450410
  • 191
  • 1
  • 13
1
vote
1 answer

Hadoop3: worker node error connecting to ResourceManager

I have a 3 node hadoop cluster (DigitalOcean droplets): hadoop-master is configured as both namenode and datanode hadoop-worker1 and hadoop-worker2 are configured as datanodes Whenever I run a mapreduce streaming job and a worker node gets…
1
vote
1 answer

Is it required to install spark on all the nodes of cluster

I am new to Spark and learning the architecture. I understood that spark supports 3 cluster managers such as YARN, Standalone and Mesos. In yarn cluster mode, Spark driver resides in Resource manager and executors in yarn's Containers of Node…
Niketa
  • 453
  • 2
  • 9
  • 24
1
vote
1 answer

Simple Spark streaming app allocates all memory in the cluster - GCP Dataproc

A simple Spark streaming app without any heavy in memory computation is consuming 17GB of Memory as soon after the STATE gets changed to RUNNING. Cluster setup: 1x master (2 vCPU, 13.0 GB memory) 2x workers (2 vCPU, 13.0 GB memory) YARN resource…
1
vote
1 answer

Query node-label topology from Yarn via REST API [MapR 6.1/Hadoop-2.7]

There is a Java and CLI-interface to query Yarn RM for node-to-nodelabel (and inverse) mappings. Is there a way to do this via the REST-API as well? An initial RM-API search revealed only node-label based job submissions as an option. Sadly that is…
Rick Moritz
  • 1,449
  • 12
  • 25
1
vote
0 answers

What is the best way to judge the performance of components of a data pipeline?

I am working on optimizing a data pipeline that leverages Apache Spark, HDFS and YARN as the cluster manager. The Spark Cluster consists of a limited amount of internal machines that are shared across a variety of groups. Thus, building certain…
1
vote
1 answer

Can a specific procesess be started inside a Map Task in a Hadoop Cluster?

I use a Hadoop & YARN cluster with one node. All hadoop and yarn daemons are started in this node. I also start a fetch step with Apache Nutch 1.15 distributed crawl, with inject and generate steps successfully finished. I am trying to run Firefox…
Iulian Barbu
  • 61
  • 11
1
vote
0 answers

Spark submit on Yarn failing with error "Permission mismatch for caller"

I am trying to submit my spark job to yarn but it keep on failing with the message: [2019-05-13 14:13:18.281]Application application_1557517779491_0093 initialization failed (exitCode=20) with output: main : command provided 0 main : run as user is…
Y0gesh Gupta
  • 2,184
  • 5
  • 40
  • 56
1
vote
2 answers

Usage of StreamingFileSink is throwing NoClassDefFoundError

I know this could be my problem, but trying to graple it for a while. I am trying to run flink in AWS EMR cluster. My setup is: Time series event from Kinesis -> flink job -> Save it to S3 DataStream kinesis = …
Nischit
  • 161
  • 1
  • 8
1
vote
1 answer

Passing multiple typesafe config files to a yarn cluster mode application

I'm struggling a bit on trying to use multiple (via include) Typesafe config files in my Spark Application that I am submitting to a YARN queue in cluster mode. I basically have two config files and file layouts are provided…
NicolasCage
  • 105
  • 9
1
vote
1 answer

On Spark-cluster.Is there a parameter that controls the minimum run time of the spark job

My Spark program will first determine whether the input data path exists and,if it does not,exit safely.But after exiting,yarn will retry the job once.So,I guess one parameter will control the minimum run time of the job. On Spark-cluster.Is there a…
shaokai.li
  • 11
  • 2
1
vote
0 answers

Increase yarn.scheduler.maximum-allocation-mb value in yarn-site.xml

yarn.scheduler.maximum-allocation-mb value is set as 143360 MB in yarn-site.xml . I got below error while running pyspark job in oozie.I want to increase its value in yarn-site.xml but i don't have permissions to increase it's value. I there a way i…
1
vote
0 answers

What happens to a spark applications requesting more memory than than the cluster has?

If there is a spark cluster with worker nodes of say x GB memory, and there are 5 such worker nodes what would happen to a applicaion if: 1. Driver memory requested in the application is > x GB 2. Driver Memory + Executor Memory * Number of…
Sayantan Ghosh
  • 998
  • 2
  • 9
  • 29
1
vote
1 answer

Count operation resulting in more rack_local pyspark

I am trying to understand the locality level on Spark cluster and its relationship with the RDD number of partitions along with the action perform on it. Specifically, I have a dataframe where the number of partitions are 9647. Then, I performed…
bohr
  • 631
  • 2
  • 9
  • 29
1
vote
1 answer

Why can't I change "spark.driver.memory" value in AWS Elastic Map Reduce?

I want to tune my spark cluster on AWS EMR and I couldn't change the default value of spark.driver.memory which leads every spark application to crash as my dataset is big. I tried editing the spark-defaults.conf file manually on the master…