Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions
1
vote
0 answers

4-profiles calculus of big graph with apache giraph

for my master thesis in computer science I succeed in implementing 4-profiles calculus (https://arxiv.org/abs/1510.02215) using giraph-1.3.0-snapshot (compiled with -Phadoop_yarn profile) and hadoop-2.8.4. I configured a cluster on amazon ec2…
1
vote
0 answers

Apache slider LLAP container fails to start

I am trying to launch an LLAP container and I see the following error in the container log Log Type: slider-agent.out Log Upload Time: Fri Nov 30 02:01:10 -0800 2018 Log Length: 783 Traceback (most recent call last): File…
Vijay Muvva
  • 1,063
  • 1
  • 17
  • 31
1
vote
0 answers

Spark in Yarn Cluster Mode - Yarn client reports FAILED even when job completes successfully

I am experimenting with running Spark in yarn cluster mode (v2.3.0). We have traditionally been running in yarn client mode, but some jobs are submitted from .NET web services, so we have to keep a host process running in the background when using…
Stuart
  • 1,572
  • 4
  • 21
  • 39
1
vote
1 answer

spark-submit --files hdfs://file get cached in /tmp on driver

I'm running a spark-submit like this: spark-submit --deploy-mode client --master yarn --conf spark.files.overwrite=true --conf spark.local.dir='/my/other/tmp/with/more/space' --conf…
maffe
  • 220
  • 2
  • 11
1
vote
1 answer

Hadoop - failed to specify server's Kerberos principal name

Error - Failed to specify server's Kerberos principal name I am trying to setup a Hadoop cluster using Kerberos. I managed to get the cluster working with Spark and Yarn before starting the Kerberos configuration. Currently my master and three nodes…
user10347849
1
vote
1 answer

Query on Yarn and Spark

I need to use spark to export data from Hive(partitioned) to Teradata(non-paritioned). Cluster spec: 120 worker nodes, each having 16 core processors, 128 GB RAM. Table size is around 130GB and when I am creating a dataframe out of it, it produces…
Rony
  • 196
  • 2
  • 15
1
vote
1 answer

Hive: Mapreduce File missing

I can enter Hive-cli and create new tables. However, when I try to insert data to the table, it says: java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/user/yarn/mapreduce/mr-framework/3.0.0-cdh6.0.1-mr-framework.tar.gz but…
user2894829
  • 775
  • 1
  • 6
  • 26
1
vote
1 answer

Yarn local-dirs - per node setup

I've had a series of devops issues from time to time on our production cluster. Every now and then, / partition gets overwhelmed on couple of nodes. Long story short, it turns out that these nodes had 1 instead of 2 data drives. This would not be an…
hummingBird
  • 2,495
  • 3
  • 23
  • 43
1
vote
1 answer

Need help in understanding pyspark execution on yarn as Master

I have already some picture of yarn architecture as well as spark architecture.But when I try to understand them together(thats what happens when apark job runs on YARN as master) on a Hadoop cluster, I am getting in to some confusions.So first I…
akhil pathirippilly
  • 920
  • 1
  • 7
  • 25
1
vote
1 answer

how to query the azure hdinsight hadoop cluster yarn timeline server

How can I query timeline server in Azure HDInsight Hadoop cluster for getting job metrics? connecting to the azure cluster: curl -u admin -sS -G "https://$CLUSTERNAME.azurehdinsight.net/api/v1/clusters/$CLUSTERNAME" connecting to timeline server:…
1
vote
0 answers

Spark on Yarn - AM container preemption vs non AM container preemption

When running Spark on Yarn I understand that Jobs may exceed their resource quota during quiet times but may be preempted when other users require their quota. This seems fair however I occasionally see that an AM quota has been preempted causing my…
Terry Dactyl
  • 1,839
  • 12
  • 21
1
vote
2 answers

Where does YARN application logs get stored in EMR before sending to S3

I have a requirement to write Yarn application logs from EMR to different source other than S3 .. Can you please lep me where does applications logs get saved in EMR master instance
Manoj4068
  • 123
  • 1
  • 7
  • 14
1
vote
0 answers

Hortonworks HDP 3 : Error starting ResourceManager

I have installed a new cluster HDP 3 using ambari 2.7. The problem is that resource manager service is not starting. I get the following error: Traceback (most recent call last): File…
1
vote
0 answers

How to configure spark and log4j in order to log into local file system in yarn cluster mode

I want to collect the log messages created by spark application into a file on the local file system. I was able to achieve this when I ran the application in client mode, however I found difficulties once running in cluster mode. My configuration…
sanyi14ka
  • 809
  • 9
  • 14
1
vote
2 answers

Running a system command in Hadoop using spark_apply from sparklyr

I want to run a Java tool on data stored in a Hadoop cluster. I am trying to do it using the spark_apply function from sparklyr, but I am a bit confused by the syntax. Before running the spark code, I've set up a conda environment following the…
dalloliogm
  • 8,718
  • 6
  • 45
  • 55
1 2 3
99
100