18

I’m running a spark application with YARN-client or YARN-cluster mode.

But it seems to take too long to startup.

It takes 10+ seconds to initialize the spark context.

Is this normal? Or can it be optimized?

The environment is as follows:

  • Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6) (Tiny test cluster with 3 data nodes)
  • Spark: 1.3.1
  • Client: Windows 7, but similar result on CentOS 6.6

The following is the startup part of the application log. (Some private information was edited)

‘Main: Initializing context’ at the first line and ‘MainProcessor: Deleting previous output files’ at the last line are the logs by the application. Others in between are from Spark itself. Application logic is executed after this log is displayed.

15/05/07 09:18:31 INFO Main: Initializing context
15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1
15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started
15/05/07 09:18:31 INFO Remoting: Starting remoting
15/05/07 09:18:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@mymachine:54449]
15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver' on port 54449.
15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker
15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster
15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772
15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity 1956.7 MB
15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9
15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SocketConnector@0.0.0.0:54450
15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file server' on port 54450.
15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/05/07 09:18:32 INFO SparkUI: Started SparkUI at http://mymachine:4040
15/05/07 09:18:32 INFO SparkContext: Added JAR file:/D:/Projects/MyApp/MyApp.jar at http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240
15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at cluster01/10.111.111.11:8050
15/05/07 09:18:32 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/05/07 09:18:32 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (23040 MB per container)
15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/05/07 09:18:32 INFO Client: Setting up container launch context for our AM
15/05/07 09:18:32 INFO Client: Preparing resources for our AM container
15/05/07 09:18:32 INFO Client: Source and destination file systems are the same. Not copying hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
15/05/07 09:18:32 INFO Client: Setting up the launch environment for our AM container
15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:33 INFO Client: Submitting application 2 to ResourceManager
15/05/07 09:18:33 INFO YarnClientImpl: Submitted application application_1430956687773_0002
15/05/07 09:18:34 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:34 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1430957906540
     final status: UNDEFINED
     tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
     user: myapp
15/05/07 09:18:35 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:36 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM@cluster02:39698/user/YarnAM#-1579648782]
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> cluster01, PROXY_URI_BASES -> http://cluster01:8088/proxy/application_1430956687773_0002), /proxy/application_1430956687773_0002
15/05/07 09:18:37 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/05/07 09:18:38 INFO Client: Application report for application_1430956687773_0002 (state: RUNNING)
15/05/07 09:18:38 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: cluster02
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1430957906540
     final status: UNDEFINED
     tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
     user: myapp
15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application application_1430956687773_0002 has started running.
15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on 54491
15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register BlockManager
15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block manager mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine, 54491)
15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster02:44996/user/Executor#-786778979] with ID 1
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files

Thanks.

UPDATE

I think I’ve found the (maybe partial, but major) reason.

It’s between the following lines:

15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753] with ID 1

When I read the logs on cluster side, the following lines were found: (the exact time is different with above line, but it’s the difference between machines)

15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000
15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454

It seemed that Spark deliberately sleeps 5 secs.

I’ve read the Spark source code, and in org.apache.spark.deploy.yarn.ApplicationMaster.scala, launchReporterThread() had the code for that. It loops calling allocator.allocateResources() and Thread.sleep(). For sleep, it reads the configuration variable spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000, which is 5 secs). According to the comment, “we want to be reasonably responsive without causing too many requests to RM”. So, unless YARN immediately fulfill the allocation request, it seems that 5 secs will be wasted.

When I modified the configuration variable to 1000, it only waited for 1 sec.

Here is the log lines after the change:

15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 1000
15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454

4 secs saved.

So, when one does not want to wait 5 secs, one can change the spark.yarn.scheduler.heartbeat.interval-ms.

I hope that the additional overhead it incurs would be negligible.

UPDATE

A related JIRA issue had been opened and resolved. See https://issues.apache.org/jira/browse/SPARK-7533

zeodtr
  • 10,645
  • 14
  • 43
  • 60

4 Answers4

4

This is pretty typical. My system takes about 20 seconds from running spark-submit until getting a SparkContext.

As it says in the docs in a couple of places, the solution is to turn your driver into an RPC server. That way you initialize once, and then other applications can use the driver's context as a service.

I am in the middle of doing this with my application. I am using http4s and turning my driver into a web server.

David Griffin
  • 13,677
  • 5
  • 47
  • 65
  • Thank you. With RPC approach, it seems that Spark dynamic resource allocation feature can be helpful - otherwise the cluster usage can be bad. – zeodtr May 07 '15 at 11:55
  • But I'm just curious that what causes 10~20 sec delay on a local network... while the servers on the other side of earth can be reached with https in a couple of seconds. – zeodtr May 07 '15 at 11:57
  • Yeah but -- if the HTTPS server on the other side of the earth had to start up with each request, and fire up its logging subsystem, and connect to HDFS, etc, it would take 20 seconds to get a response too! – David Griffin May 07 '15 at 12:46
  • I've searched http4s for RPC server, but a bit disappointed for it's a scala library, since I'm currently writing Spark applications with Java 8. I've searched for other solutions and found a very similar one written in Java. Interestingly, it's name is also Spark. :) http://sparkjava.com/ – zeodtr May 08 '15 at 01:06
  • Ah yes, sorry about. I actually learned Scala just because of Spark. – David Griffin May 08 '15 at 01:09
  • Please see my question update. I think I've found a major cause for me, but not for your case. if waiting is about 20 secs, it's likely that the cluster is actually slow to allocate the resources. But maybe you can save a couple of secs. – zeodtr May 08 '15 at 04:27
  • If you are running multitude of jobs, each potentially running minutes, running and scheduling on 10's of nodes, maybe in the 100's, importance of startup overhead becomes less. Resources not yet used become useful to other jobs already started. – YoYo Nov 21 '15 at 05:39
  • Instead of building an "rpc server" from scratch, have a look at an existing project Apache Livy, a REST API server for Apache Spark http://livy.io/quickstart.html / https://github.com/cloudera/livy – Tagar May 25 '17 at 22:37
4

For the fast creation of Spark-Context

Tested on EMR:

  1. cd /usr/lib/spark/jars/; zip /tmp/yarn-archive.zip *.jar

  2. cd path/to/folder/of/someOtherDependancy/jarFolder/; zip /tmp/yarn-archive.zip jar-file.jar

  3. zip -Tv /tmp/yarn-archive.zip for Test integrity and Verbose debug

  4. if yarn-archive.zip already exists on hdfs then hdfs dfs -rm -r -f -skipTrash /user/hadoop/yarn-archive.zip hdfs dfs -put /tmp/yarn-archive.zip /user/hadoop/ else hdfs dfs -put /tmp/yarn-archive.zip /user/hadoop/

  5. --conf spark.yarn.archive="hdfs:///user/hadoop/yarn-archive.zip" use this argument in spark-submit

The reason why this can work is, the master does not have to distribute all the jars to the slaves. It is available to them from some common hdfs path here it is hdfs:///user/hadoop/yarn-archive.zip.

I realized that it can save your time by 3-5 seconds, this time also depends on the number of nodes in the cluster. More the nodes, more you save the time.

desaiankitb
  • 992
  • 10
  • 17
  • it barely makes a difference. With a lot of testing I can not measure any significant improvement. – Kiwy Feb 10 '22 at 14:19
2

You could check Apache Livy which is a REST API in front of Spark.

You could have one session and multiple requests to that one Spark/Livy session.

Tagar
  • 13,911
  • 6
  • 95
  • 110
-1

If you're using mac os to run some tasks in standalone mode,

Just remember to enable the remote ssh connection in your system preference -> sharing (I don't know the reason why we should do that)

Before enabling it, it takes me about 1 minute to run spark-submit xx.py

After enabling it, it only takes me 3 seconds to run it.

I hope it can help others has the kind of issues on Mac os platform

poc.hsu
  • 7
  • 1
  • Could you explain more about what it is or how to do it? Maybe a reference to a doc? Thanks – ldmtwo Jul 30 '19 at 22:52