Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions
1
vote
2 answers

get ip of emr master node from yarn cli

In order to get a list of the ip addresses of emr slave nodes, one must run the following code: yarn node -list 2>/dev/null \ | sed -n "s/^\(ip[^:]*\):.*/\1/p" yarn node -list happens to print off the ip of the master node to stderr: 19/04/02…
Walrus the Cat
  • 2,314
  • 5
  • 35
  • 64
1
vote
0 answers

Spark executor lost when increasing the number of executor instances

My Hadoop cluster currently has 4 nodes and 45 cores running pyspark 2.4 through YARN. When I run spark-submit with one executor everything works fine, but if I change the number of executor-instances to 3 or 4 the executor is killed by the driver…
Mahmoud Odeh
  • 942
  • 1
  • 7
  • 19
1
vote
0 answers

Hadoop Compression ERROR: java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

When I'm running Apache Kylin on Hadoop, I met the following error related to Hadoop MapReduce: 2019-03-20 08:06:00,193 ERROR [main] org.apache.kylin.engine.mr.KylinMapper: java.lang.UnsatisfiedLinkError:…
Jason27
  • 11
  • 1
  • 3
1
vote
1 answer

Query taking time despite adding session settings

Following is the ETL generated query Query - SELECT infaHiveSysTimestamp('SS') as a0, 7991 as a1, single_use_subq30725.a1 as a2, SUBSTR(SUBSTR(single_use_subq30725.a2, 0, 5), 0, 5) as a3, CAST(1 AS SMALLINT) as a4, single_use_subq30725.a3 as a5,…
Kumar
  • 119
  • 10
1
vote
1 answer

Hadoop CLI command to get Total Memory Used, etc like shown in Hadoop Web UI on 8088

Is there a CLI command I can use to get the Metrics show in this picture as they appear in the Hadoop Web UI on 8088?
timbram
  • 1,797
  • 5
  • 28
  • 49
1
vote
2 answers

How to find yarn application statistics from command line in human readable format

I have an application with some id like application_2019xxxxxxxxxxxxx I'm able to find it's statistics with command yarn application -status application_2019xxxxxxxxxxxxx which gives output in key-value format. The issue here is some of the fields…
amol_shaligram
  • 226
  • 1
  • 11
1
vote
1 answer

Springboot spark yarn

I am new to Spark, and I am trying to submit my spring spark application to yarn cluster. The spark config is initialized in the spring, but it is not getting the yarn detail while submitting, and it always points to local. I know am missing out…
Darklord
  • 61
  • 1
  • 5
1
vote
1 answer

hive Query hits the same view multiple times, any optimal way to approach this query

We are supporting an application which are running huge hive queries triggered via ETL tool. The query after the mapping runs on hive. The query is very big but its structure looks like this. INSERT INTO Table2 Select t1.f0,…
Kumar
  • 119
  • 10
1
vote
2 answers

Approach to reduce the execution time of a Hive query

We run this below query daily and this query runs for 3 hours or so, owing due to sheer volume of data in the transaction table. Is there any way we can tune this query or reduce the execution time? CREATE TEMPORARY TABLE t1 AS SELECT…
akash sharma
  • 411
  • 2
  • 24
1
vote
2 answers

Spark job in Dataproc dynamic vs static allocation

I have a Dataproc cluster: master - 6cores| 32g worker{0-7} - 6cores| 32g Maximum allocation: memory:24576, vCores:6 Have two spark-streaming jobs to submit, one after another In the first place, I tried to submit with default configurations…
1
vote
1 answer

Stuck in App time line server installation in ambari 2.6.2

Getting error when installing App time line server . Please find the below error . stderr: Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py", line 89,…
1
vote
0 answers

How to get the maximum word count in Hadoop?

I have managed to get my Word Count program underwraps and now I want to be able to get the maximum occurrence. My output for my WordCount looks like this: File1:Word1: x File1:Word2: x Where File represents a File, Word represents the searched…
Namorange
  • 81
  • 1
  • 5
1
vote
2 answers

Apache flink - Timeout after submitting job on hadoop / yarn cluster

I am trying to upgrade our job from flink 1.4.2 to 1.7.1 but I keep running into timeouts after submitting the job. The flink job runs on our hadoop cluster (version 2.7) with Yarn. I've seen the following behavior: Using the same flink-conf.yaml…
Richard Deurwaarder
  • 2,023
  • 1
  • 26
  • 40
1
vote
0 answers

Error using pyspark: ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig

I'm trying to get a pyspark script running remotely on AWS EMR, following the instructions provided by AWS. However, when I try to submit my script, I am getting the following exception: Traceback (most recent call last): File…
aco
  • 719
  • 2
  • 9
  • 26
1
vote
1 answer

Can't kill YARN apps using ResourceManager UI after HDP 3.1.0.0-78 upgrade

I recently upgraded HDP from 2.6.5 to 3.1.0, which runs YARN 3.1.0, and I can no longer kill applications from the YARN ResourceManager UI, using either the old (:8088/cluster/apps) or new (:8088/ui2/index.html#/yarn-apps/apps) version. I can still…
ammills01
  • 709
  • 6
  • 28