Spark Driver in Apache spark

Question

I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark driver is? and should we set a IP and port to spark driver by spark.driver.host and where it will be executed and located? (master or slave)

maasg · Answer 1 · 2014-12-03T04:13:23.387

161

The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.

In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port>

Its location is independent of the master/slaves. You could co-located with the master or run it from another node. The only requirement is that it must be in a network addressable from the Spark Workers.

This is how the configuration of your driver looks like:

val conf = new SparkConf()
      .setMaster("master_url") // this is where the master is specified
      .setAppName("SparkExamplesMinimal")
      .set("spark.local.ip","xx.xx.xx.xx") // helps when multiple network interfaces are present. The driver must be in the same network as the master and slaves
      .set("spark.driver.host","xx.xx.xx.xx") // same as above. This duality might disappear in a future version

val sc = new spark.SparkContext(conf)
    // etc...

To explain a bit more on the different roles:

The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.

edited Dec 03 '14 at 04:13

answered Jul 08 '14 at 17:40

maasg

37,100
11
88
115

Thank you for the response- if we don't assign any ip for spark driver, its default is localhost am I right? – user3789843 Jul 09 '14 at 04:16
1

I didn't set up Driver program anywhere, I am getting the result but I got the following in my stderr: 14/07/04 17:50:14 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] -> [akka.tcp://spark@master:54477] disassociated! Shutting down. – user3789843 Jul 09 '14 at 04:50
I got this too : Spark Executor Command: "java" "-cp" ":: /usr/local/spark-1.0.0/conf: /usr/local/spark-1.0.0 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf" " -XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend " "akka.tcp://spark@master:54477/user/CoarseGrainedScheduler" "0" "slave2" "1 " "akka.tcp://sparkWorker@slave2:41483/user/Worker" "app-20140704174955-0002" ======================================== – user3789843 Jul 09 '14 at 04:54
Could you please clarify one thing for me @maasg - if I have slave S1, S2, S3 say, and submit a job from S1 (i.e. run `java -cp my.jar`), then `toArray` will cause the data to come back to S1. So I now have 2 JVMs running on S1 which have to share memory. Is there any way to tell spark to share the memory - i.e. keep large objects in the S1 worker JVM memory?? Should I ask this as a separate question? – samthebest Jul 09 '14 at 10:34
2

@user3789843 re:Ip address of spark driver: No. The default ip is the first non-local address configured on the host. The drivers starts a file-sharing server where the slaves pick up the jar of the job. If the driver address is `localhost`, it will not be reachable from the slaves. – maasg Jul 09 '14 at 12:31
@samthebest Hi Sam, that's not a Spark restriction, but a jvm design principle. Back in 1995, one of the design cornerstones of the jvm was to make Java a safe language and prevent any direct memory access. If you need to share memory, look into off-heap solutions like [Apache Tachyon](http://tachyon-project.org/index.html) – maasg Jul 09 '14 at 12:43
Actually I am running these three nodes by virtual box and I have the following in my host : localhost 127.0.0.1, virtualbox 127.0.1.1 , master 192.168.0.1, slave1 192.168.0.2, slave 2 192.168.0.3. so,the default would pick master'ip as spark driver? I also submit the application in my master node with spark-submit Comments use mini-Markdown formatting: [link](http://example.com) _italic_ **bold** `code`. The post author will always be notified of your comment. To also notify a previous commenter, mention their user name: @peter or @PeterSmith will both work. Learn more… – user3789843 Jul 09 '14 at 12:55
@user3789843 where are you executing your spark driver (program)? How are you running you 'spark program'? Could you add that to the question? – maasg Jul 10 '14 at 09:53
I am running in my master node like : /bin/spark-submit \ --master spark://207.184.161.138:7077 \ examples/src/main/python/pi.py \ 1000 – user3789843 Jul 10 '14 at 10:24
@maasg I think when we are running an application in a node(master), in fact we are executing the driver in that node(master) – user3789843 Jul 10 '14 at 10:33
@user3789843 yes. That's correct. The driver will 'live' in that node for the time of its execution. Are you having issues with that? That's not clear from your question. – maasg Jul 10 '14 at 11:22
@maasg Actually, I got this in stderr ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] -> [akka.tcp://spark@master:54477] disassociated! Shutting down. and nothing in stdout I am not sure this can be a problem or not – user3789843 Jul 10 '14 at 13:16
@user3789843 can't say for sure w/o looking at the job, but looks like it's probably that your driver terminated before the execution was finished. If you are getting the results of your job, probably a minor issue. If not, create a new question with the job code and the complete stack trace. – maasg Jul 10 '14 at 16:54
I think the "disassociated" error occurs if your driver terminates without cleanly shutting down the SparkContext. The executor notices that the driver is no longer reachable, prints out that error, and shuts itself down. – reggert Sep 05 '14 at 22:49
this could be a little bit late, but i have a question,i used win7 for the 2 machines 1 started as master, and used as driver (i think) too. and the other was a slave, my question is should all the nodes (slaves/master/driver) have same setup (installation locations and resource paths) i mean i have tried to submit a scala app to the master node, and i got an error message on the slave that `c:\java\java1.6\bin\java` is not found, that path is where java is installed on the master/driver so should it be the same installation location on other nodes? – Yazan Aug 31 '15 at 12:01
why would the worker need to connect to the driver? The driver has a connection to the master which in turn has a connection to the worker. – Frederick Nord Sep 11 '15 at 12:50
@FrederickNord the driver serves the resources (JARs) needed for the executors to run tasks from the job. Hence the executors need to download those resources from the driver. The master is there to schedule execution. – maasg Sep 11 '15 at 16:31

score 13 · Answer 2 · answered Apr 26 '17 at 07:49

You question is related to spark deploy on yarn, see 1: http://spark.apache.org/docs/latest/running-on-yarn.html "Running Spark on YARN"

Assume you start from a spark-submit --master yarn cmd :

The cmd will request yarn Resource Manager (RM) to start a ApplicationMaster (AM)process on one of your cluster machines (those have yarn node manager installled on it).
Once the AM started, it will call your driver program's main method. So the driver is actually where you define your spark context, your rdd, and your jobs. The driver contains the entry main method which start the spark computation.
The spark context will prepare RPC endpoint for the executor to talk back, and a lot of other things(memory store, disk block manager, jetty server...)
The AM will request RM for containers to run your spark executors, with the driver RPC url (something like spark://CoarseGrainedScheduler@ip:37444) specified on the executor's start cmd.

The Yellow box "Spark context" is the Driver.

where is the yarn queue set? I found if i use spark-submit CLIENT mode, and the application main (where it creates the spark context) sets the queue name takes effect.... but if i do spark-submit CLUSTER mode without --queue 'queName', it gets set the default queue, i.e. the application main setting que name does not take effect. — soMuchToLearnAndShare, Dec 02 '19 at 15:10

score 5 · Answer 3 · answered Jan 25 '16 at 08:38

5

A Spark driver is the process that creates and owns an instance of SparkContext. It is your Spark application that launches the main method in which the instance of SparkContext is created. It is the cockpit of jobs and tasks execution (using DAGScheduler and Task Scheduler). It hosts Web UI for the environment

It splits a Spark application into tasks and schedules them to run on executors. A driver is where the task scheduler lives and spawns tasks across workers. A driver coordinates workers and overall execution of tasks.

answered Jan 25 '16 at 08:38

nitesh

77
1
7

Are you sure about that? Lets use say trivially we have a four node cluster, with one namenode (nn1), three data nodes (dn1, dn2, dn3), that the driver will not only break up the work into tasks (t1, t2, t3, etc), but assign a task t1 to a specific node dn1? I don't think it has knowledge of who gets what task, and that the master controls that. – uh_big_mike_boi Jun 11 '16 at 19:38
@mike, yes it does. More specifically, Spark is able to manage data locality, and thus send task where the data resides. Have a look to http://spark.apache.org/docs/latest/configuration.html#scheduling, for instance. – mathieu Nov 08 '16 at 10:50
I think we are confusing the driver with spark in general. Maybe I wasn't clear. In this locality setting, the configuration you pointed out deals with picking another node if the wait is too long. It does not know that node A has data 1-10 and node B has data 11-20 and node C has data 21-30. It sends the direction to the cluster manager to pick the local node, and the cluster manager knows that data 21-30 is in node C. Essentially the cluster manager is a record keeper, not a decision maker. – uh_big_mike_boi Nov 08 '16 at 15:17
The driver is the application. The master process controls how to execute the driver's api requests - these two can run run on separate machines or the same machine, but they are distinct. – ThatDataGuy Jan 06 '17 at 21:49

score 4 · Answer 4 · answered Feb 20 '17 at 06:26

In simple term, Spark driver is a program which contains the main method (main method is the starting point of your program). So, in Java, driver will be the Class which will contain public static void main(String args[]).

In a cluster, you can run this program in either one of the ways: 1) In any remote host machine. Here you'll have to provide the remote host machine details while submitting the driver program on to the remote host. The driver runs in the JVM process created in remote machine and only comes back with final result.

2) Locally from your client machine(Your laptop). Here the driver program runs in JVM process created in your machine locally. From here it sends the task to remote hosts and wait for the result from each tasks.

score 2 · Answer 5 · answered Jul 30 '17 at 07:47

If you set config "spark.deploy.mode = cluster", then your driver will be launched at your worker hosts(ubuntu2 or ubuntu3).
If spark.deploy.mode=driver, which is the default value, then the driver will run on the machine your submit your application.
And finally, you can see your application on web UI: http://driverhost:driver_ui_port, where the driver_ui_port is default 4040, and you can change the port by set config "spark.ui.port"

score 2 · Answer 6 · answered Feb 15 '19 at 15:03

Spark driver is node that allows application to create SparkContext, sparkcontext is connection to compute resource. Now driver can run the box it is submitted or it can run on one of node of cluster when using some resource manager like YARN.

Both options of client/cluster has some tradeoff like

Access to CPU/Memory of once of the node on cluster, some time this is good because cluster node will be big in terms memory.
Driver logs are on cluster node vs local box from where job was submitted.
You should have history server for cluster mode other wise driver side logs are lost.
Some time it is hard to install dependency(i.e some native dependency) executor and running spark application in client mode comes to rescue.

If you want to read more on Spark Job anatomy then http://ashkrit.blogspot.com/2018/09/anatomy-of-apache-spark-job.html post could be useful

score 2 · Answer 7 · answered Feb 07 '20 at 11:51

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

Spark cluster components

There are several useful things to note about this architecture:

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system. Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN). The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes. Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

Spark Driver in Apache spark

7 Answers7