Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions
1
vote
1 answer

yarn logs - stdout and stderr became huge files - how to avoid that

Dears friends and colleges we have ambari cluster with hadoop version - 2.6.4 cluster include 52 datanode machines , and the follwing issue is happened on 9 datanodes machines so I will explain the problem: We noticed about critical problem…
Judy
  • 1,595
  • 6
  • 19
  • 41
1
vote
1 answer

how to run spark on yarn-client

Am trying to run pyspark on yarn-client, am not sure what might be reason and the can't interpret the logs correctly import sys from pyspark.sql import SparkSession from pyspark import SparkContext, SparkConf conf =…
Exorcismus
  • 2,243
  • 1
  • 35
  • 68
1
vote
1 answer

how to connect sparkcontext to CDH 6 on yarn

i'm trying to run a simple mllib function (fpgrowth) from java from a remote computer on CDH 6 community version. as default i tried to connect like this : `SparkConf conf = new SparkConf().setAppName("FPGrowth").setMaster("spark://some…
m scorpion
  • 21
  • 4
1
vote
1 answer

How to copy EMR streaming job logs to S3 and clean logs on EMR core node's disk

Good day, I am running a Flink (v1.7.1) streaming job on AWS EMR 5.20, and I would like to have all task_managers and job_manager's logs of my job in S3. Logback is used as recommended by the Flink team. As it is a long-running job, I want the logs…
Averell
  • 793
  • 2
  • 10
  • 21
1
vote
2 answers

What is the Amazon Snowflake execution engine

When you are on Hadoop, you can have YARN manage the Hadoop jobs, resources, etc ... What is the equivalent form for Amazon's Snowflake? Hadoop (HDFS) is to YARN as Snowflake is to __________
Micah Pearce
  • 1,805
  • 3
  • 28
  • 61
1
vote
0 answers

The CPU limit formula still valid if there are many threads running for yarn cgroups?

From the URL https://developer.ibm.com/hadoop/2017/06/30/deep-dive-yarn-cgroups/ we see the following CPU limit formulas for yarn cgroups ,my question is those formulas still valid if there are many threads (suppose 500 threads and each doing the…
YuFeng Shen
  • 1,475
  • 1
  • 17
  • 41
1
vote
2 answers

How to tail yarn logs?

I am submitting a Spark Job using below command. I want to tail the yarn log using application Id similar to tail command operation in Linux box. export SPARK_MAJOR_VERSION=2 nohup spark-submit --class "com.test.TestApplication" --name TestApp…
Vasanth Subramanian
  • 1,040
  • 1
  • 13
  • 32
1
vote
1 answer

Running MapReduce word count on Hadoop gives Exception message: The system cannot find the path specified

this is my first Stack Overflow question ever. I've setup my hadoop (2.9.2) single node cluster in pseudo distributed mode. When i try to run hadoop jar C:/MapReduceClient.jar wordcount /input_dir /output_dir, i get the following log with…
1
vote
0 answers

YARN workers running out of disk space

We are facing a No space on device error with Spark jobs running on our YARN cluster. This has a few bad results. First, the Spark jobs take longer or fail. Second, since the disk fills up, the nodes are disabled by the YARN NodeManager and are…
summerbulb
  • 5,709
  • 8
  • 37
  • 83
1
vote
0 answers

what is the usage of instantaneous fair share in FairSchduler?

In fair Scheduler,we take the instantaneous fair share as a very important value,but I didn't see the usage about the sorting the jobs except in the preemtion or the maxAMShare?
littlehyde
  • 11
  • 1
1
vote
1 answer

Kill Spark Job or terminate EMR Cluster if job takes longer than expected

I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running…
conradlee
  • 12,985
  • 17
  • 57
  • 93
1
vote
1 answer

problem with Google cloud dataproc clusters create --properties tag

I was trying to enable yarn.log-aggregation-enable upon creating a Dataproc cluster using gcloud command like below, according to https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties gcloud beta dataproc clusters…
1
vote
0 answers

Need to capture hive query from application id

I want to find the hive query from application id, I know that in TEZ view I can see the query, but i want to know if there is any api which I can use to find query from application id from command line or through curl command?
Nick
  • 11
  • 2
1
vote
1 answer

Running submitted job sequentially in Google Cloud Dataproc

I created Google Dataproc cluster with 2 workers using n1-standard-4 VMs for master and workers. I want to submit jobs on a given cluster and all jobs should run sequentially (like on AWS EMR), i.e., if first job is in running state then upcoming…
Neo-coder
  • 7,715
  • 4
  • 33
  • 52
1
vote
0 answers

Spark 2.31 custom non ambari installation on HDP performance issue

I have Spark 2.3.1 custom non ambari installation on HDP 2.6.2 running on a cluster. I have made all the necessary configuration as per the spark and non ambari installation guides. Now when I submit the spark job in Yarn cluster mode, I see huge…
1 2 3
99
100