Need help in understanding pyspark execution on yarn as Master

Question

I have already some picture of yarn architecture as well as spark architecture.But when I try to understand them together(thats what happens when apark job runs on YARN as master) on a Hadoop cluster, I am getting in to some confusions.So first I will say my understanding with below example and then I will come to my confusions

Say I have a file "orderitems" stored on HDFS with some replication factor. Now I am processing the data by reading this file in to a spark RDD (say , for calculating order revenue). I have written the code and configured the spark submit as given below

    spark-submit \
    --master yarn \
    --conf spark.ui.port=21888 \
    --num-executors 2 \
    --executor-memory 512M \
    src/main/python/order_revenue.py

Lets assume that I have created the RDD with a partition of 5 and I have executed in yarn-client mode.

Now As per my understanding , once I submit the spark job on YARN,

Request goes to Application manager which is a component of resource manager.
Application Manager will find one node manager and ask it to launch a container.
This is the first container of an application and we will call it an Application Master.
Application master takes over the responsibility of executing and monitoring the job.

Since I have submitted on client mode,driver program will run on my edge Node/Gateway Node. I have provided num-executors as 2 and executor memory as 512 mb

Also I have provided no.of partitions for RDD as 5 which means , it will create 5 partitions of data read and distribute over 5 nodes.

Now here my few confusions over this

I have read in user guide that, partitions of rdd will be distributed to different nodes. Does these nodes are same as the 'Data Nodes' of HDFS cluster? I mean here its 5 partitions, does this mean its in 5 data nodes?

I have mentioned num-executors as 2.So this 5 partitions of data will utilizes 2 executors(CPU).So my nextquestion is , from where this 2 executors (CPU) will be picked? I mean 5 partitions are in 5 nodes right , so does these 2 executors are also in any of these nodes?

The scheduler is responsible for allocating resources to the various running applications subject to constraints of capacities, queues etc. And also a Container is a Linux Control Group which is a linux kernel feature that allows users to allocate CPU,memory,Disk I/O and Bandwidth to a user process. So my final question is Containers are actually provided by "scheduler"?

I am confused here. I have referred architecture, release document and some videos and got messed up.

Expecting helping hands here.

Anurag Sharma · Answer 1 · 2018-11-04T17:18:56.597

To answer your questions first:

1) Very simply, Executor is spark's worker node and driver is manager node and have nothing to do with hadoop nodes. Assume executors to be processing units (say 2 here) and repartition(5) divides data in 5 chunks to be by these 2 executors and on some basis these data chunks will be divided amongst 2 executors. Repartition data does not create nodes

Spark cluster architecture:

Spark on yarn client mode:

Spark on yarn cluster mode:

For other details you can read the blog post https://sujithjay.com/2018/07/24/Understanding-Apache-Spark-on-YARN/

and https://0x0fff.com/spark-architecture/

Need help in understanding pyspark execution on yarn as Master

1 Answers1