I have already some picture of yarn architecture as well as spark architecture.But when I try to understand them together(thats what happens when apark job runs on YARN as master) on a Hadoop cluster, I am getting in to some confusions.So first I will say my understanding with below example and then I will come to my confusions
Say I have a file "orderitems" stored on HDFS with some replication factor. Now I am processing the data by reading this file in to a spark RDD (say , for calculating order revenue). I have written the code and configured the spark submit as given below
spark-submit \
--master yarn \
--conf spark.ui.port=21888 \
--num-executors 2 \
--executor-memory 512M \
src/main/python/order_revenue.py
Lets assume that I have created the RDD with a partition of 5 and I have executed in yarn-client mode.
Now As per my understanding , once I submit the spark job on YARN,
- Request goes to Application manager which is a component of resource manager.
- Application Manager will find one node manager and ask it to launch a container.
- This is the first container of an application and we will call it an Application Master.
- Application master takes over the responsibility of executing and monitoring the job.
Since I have submitted on client mode,driver program will run on my edge Node/Gateway Node. I have provided num-executors as 2 and executor memory as 512 mb
Also I have provided no.of partitions for RDD as 5 which means , it will create 5 partitions of data read and distribute over 5 nodes.
Now here my few confusions over this
I have read in user guide that, partitions of rdd will be distributed to different nodes. Does these nodes are same as the 'Data Nodes' of HDFS cluster? I mean here its 5 partitions, does this mean its in 5 data nodes?
I have mentioned num-executors as 2.So this 5 partitions of data will utilizes 2 executors(CPU).So my nextquestion is , from where this 2 executors (CPU) will be picked? I mean 5 partitions are in 5 nodes right , so does these 2 executors are also in any of these nodes?
- The scheduler is responsible for allocating resources to the various running applications subject to constraints of capacities, queues etc. And also a Container is a Linux Control Group which is a linux kernel feature that allows users to allocate CPU,memory,Disk I/O and Bandwidth to a user process. So my final question is Containers are actually provided by "scheduler"?
I am confused here. I have referred architecture, release document and some videos and got messed up.
Expecting helping hands here.