0

I am submitting my spark jobs from a local laptop to a remote standalone Spark cluster (spark://IP:7077). It is submitted successfully. However, I do not get any output and it fails after some time. When i check the workers on my cluster, I find the following exception:

Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver@localhost:54561/]/user/CoarseGrainedScheduler]

When I run the same code on my local system (local[*]), it runs successfully and gives the output.

Note that I run it in spark notebook. The same application runs successfully on the remote standalone cluster when i submit it via terminal using spark-submit

Am I missing something in the configuration of notebook? Any other possible causes?

The code is very simple.

Detailed exception:

Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver@localhost:54561/]/user/CoarseGrainedScheduler]
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
    at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
    at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
    at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
    at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
    at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
    at akka.actor.ActorRef.tell(ActorRef.scala:125)
    at akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)
    at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
    at akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
    at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
    at akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
    at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
    at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:203)
    at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
    at akka.actor.ActorCell.terminate(ActorCell.scala:338)
    at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
    at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
    at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
    at akka.dispatch.Mailbox.run(Mailbox.scala:218)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Sample code

val logFile = "hdfs://hostname/path/to/file"
val conf = new SparkConf() 
.setMaster("spark://hostname:7077") // as appears on hostname:8080
.setAppName("myapp")
.set("spark.executor.memory", "20G")
.set("spark.cores.max", "40")
.set("spark.executor.cores","20")
.set("spark.driver.allowMultipleContexts","true")

val sc2 = new SparkContext(conf)
val logData = sc2.textFile(logFile)
val numAs = logData.filter(line => line.contains("hello")).count()
val numBs = logData.filter(line => line.contains("hi")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
Spark User
  • 11
  • 5
  • please post your SparkConf setup and the command you are running to submit, curious if you use --deploy-mode – JimLohse Jan 14 '16 at 16:28
  • and BTW local[] runs everything in the same JVM so it's not a proper indicator of a good network setup :) – JimLohse Jan 14 '16 at 16:33
  • Sorry to pepper you with questions and I am sure someone smarter than me will come along with a more direct solution, when you go to spark://master:8080 (wherever the master webui is) what's actually on that page? Is it spark://hostname:7077, spark://localhost:7077, spark://ip.add.res.s:7077? That's controlled by SPARK_MASTER_IP and may affect this situation. Again, hopefully someone comes along who can really zero in on this problem, maybe I am barking up the wrong tree but it really seems like the worker can't connect back to the master and it's a network issue. – JimLohse Jan 18 '16 at 18:44

2 Answers2

1

Update:

The above issue can be avoided by including the IP address of driver (i.e., local laptop's public IP) within the application code. This can be done by adding the following line in the spark context:

.set("spark.driver.host",YourSystemIPAddress)

However, there can be issue if the driver's IP address is behind the NAT. In this case the workers will not be able to find the IP.

Spark User
  • 11
  • 5
0

When you say "spark notebook" I am assuming you mean the github project https://github.com/andypetrella/spark-notebook?

I would have to look into specifics of notebook but I notice your worker is trying to connect to a master on "localhost".

For normal Spark configuration, on the worker set SPARK_MASTER_IP in $SPARK_HOME/conf/spark-env.sh and see if that helps, Even if you are running on a single machine in standalone mode, set this. In my experience Spark doesn't always resolve hostnames properly so starting from a baseline of all IPs is a good idea.

The rest is general info, see if it helps with your specific issue:

If you are submitting to a cluster from your laptop you use --deploy-mode to cluster to tell your driver to run on one of the worker nodes. This creates an extra consideration of how you setup your classpath because you don't know which worker the driver will run on.

Here's some general info in the interest of completeness, there is a known Spark bug about hostnames resolving to IP addresses. I am not presenting this as the complete answer in all cases, but I suggest trying with a baseline of just using all IPs, and only use the single config SPARK_MASTER_IP. With just those two practices I get my clusters to work and all the other configs, or using hostnames, just seems to muck things up.

So in your spark-env.sh get rid of SPARK_LOCAL_IP and change SPARK_MASTER_IP to an IP address, not a hostname.

I have treated this more at length in this answer.

For more completeness here's part of that answer:

Can you ping the box where the Spark master is running? Can you ping the worker from the master? More importantly, can you password-less ssh to the worker from the master box? Per 1.5.2 docs you need to be able to do that with a private key AND have the worker entered in the conf/slaves file. I copied the relevant paragraph at the end.

You can get a situation where the worker can contact the master but the master can't get back to the worker so it looks like no connection is being made. Check both directions. I think the slaves file on the master node, and the password-less ssh can lead to similar errors to what you are seeing.

Per the answer I crosslinked, there's also an old bug but it's not clear how that bug was resolved.

Community
  • 1
  • 1
JimLohse
  • 1,209
  • 4
  • 19
  • 44
  • 1
    Hi Jim, Thanks for the answer and sorry for coming back late with my reply. As i mentioned in my post that I can submit my applications using spark-submit from command line and they work successfully. The only problem comes when I try to submit using any notebook (Zeppelin or spark-notebook). Seems like there is problem in communication with the web browser. (?) – Spark User Jan 18 '16 at 17:59
  • Hi @SparkUser, maybe I am missing something but it still sounds like you are dealing with two variables here. Case 1) When you submit from notebook running on your laptop to a remote cluster, you get the error. Case 2) When you log into the cluster directly (in a terminal using ssh I assume?) and *you don't use notebook,* it works. Have I got that right? So correct me if I am wrong, please, it seems you have variable 1) Notebook and variable 2) client vs. cluster deploy modes. I still see localhost in your initial error and I suspect that's the problem. – JimLohse Jan 18 '16 at 18:13
  • So until I was seeing an actual IP address instead of localhost, I would suspect that. If would be nice if you could edit your question to include the simple code you are trying, and mention which OS? If it's Ubuntu/Debian there's interesting issues in the way /etc/hosts is setup, so it gets into your config, does etc/nsswitch.conf look at files first or dns? Please clarify is your call to master where it says IP is really the IP, and SPARK_MASTER_IP is an ip address on the cluster's conf/spark-defaults.conf. Willing to admit maybe barking up wrong tree, but in my experience these are IP iss. – JimLohse Jan 18 '16 at 18:17
  • A correction in the case 2) I don't log into the cluster, but I submit spark application directly from the command line (instead of using notebook) of my laptop using spark-submit command from home folder of project. and this works without any issues. So it seems that the workers have no issues sending the results back to my laptop. Should we say that there is only variable now and that is the notebook :-)? – Spark User Jan 19 '16 at 15:15
  • Yeah sorry misunderstood when you said "The same application runs successfully on the remote standalone cluster when i submit it via terminal using spark-submit" I thought you meant via ssh. Beats me then, please post whatever fixes this, perplexing, sorry. You could edit your question to include info about which OS and then I could eliminate the Debian/Spark IP vs. localhost vs. hostnames not resolving properly. I would still suggest you widen the attention you get by adding python to your tags on your question. thanks for your patience! – JimLohse Jan 19 '16 at 15:28
  • Also the specific command line you use with spark submit, your SparkConf and your Master setting inside Notebook would be good information :) But I think you are above my paygrade haha – JimLohse Jan 19 '16 at 15:31
  • the spark context for command line and notebook are the same ..except that I also submit the project jar file (that we get after packaging via sbt) in the arguments with spark-submit in command line. BTW I use mac OS 10.11.1 on my laptop. – Spark User Jan 19 '16 at 15:35