0

I have a common problem where I start an AWS EMR Cluster and log in via SSH and then run spark-shell to test some Spark code and sometimes I lose my internet connection and Putty throws an error that the connection was lost.

But it seems the Spark related processes are still running. When I reconnect to the server and run spark-shell again, I get a lot of these errors:

17/02/07 11:15:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1486465722770_0002_01_000003 on host: ip-172-31-0-217.eu-west-1.compute.internal. Exit status: 1. Diagnostics: Exception from container-launch.

Googling this error suggested there are problems with the allocated memory, but as I am using small nodes on a test cluster, I don't even want to allocate more memory, I just want to release the resources used an restart the spark-shell, but I don't see any "Spark" processes running.

How can I fix this easily? Is there some other process I should try closing/restarting, like hadoop, mapred, yarn etc? I wouldn't want to start a new cluster every time I experience this.

V. Samma
  • 2,558
  • 8
  • 30
  • 34

1 Answers1

2

you can use the Yarn api for that.. After SSH-ing to master, run this

yarn application -list

to see if there applications running. if there are you can use this command to kill them:

yarn application -kill <application id>

you can also use the resource manager web ui for doing the same thing. (available as a link on the top page of the cluster EMR page).

BTW you can use Zeppelin for running the same stuff you run on Spark-shell without worrying about disconnecting.. it is available on EMR (you need to select it as one of the applications when setting up a cluster).

it takes some time learning how to use and configure properly but might help you..

Tal Joffe
  • 5,347
  • 4
  • 25
  • 31
  • Well, yes, there were active applications. I tried with the kill command and also from the Resoure Manager. I also made sure I killed all Spark processes and stopped and started the resource manager again with this command: `sudo /sbin/stop hadoop-yarn-resourcemanager`. But I still got the "Container marked as failed" error. – V. Samma Feb 08 '17 at 09:00
  • so I guess i did not fully understand your issue.. are you saying that there are spark applications running on cluster or not? BTW not sure you should stop and start the resource manager.. if you did a kill you can run yarn application -list again to make sure they were killed – Tal Joffe Feb 08 '17 at 09:46
  • Yes, they were running and I killed them. I checked the list again and none were running. I also made sure all Spark processes were closed. But still, retrying to start `spark-shell` threw these exceptions mentioned above. Restarting the resource manager was a suggested solution when I googled the problem when starting the spark-shell kept endlessly throwing this message: `INFO Client: Application report for application_1462362812913_0001 (state: ACCEPTED)`. – V. Samma Feb 08 '17 at 10:05
  • you can simply try running ps -ef | grep spark and killing the processes you find (you may already did this it is just the only think I could think of). Did you try using Zeppelin? – Tal Joffe Feb 08 '17 at 11:22
  • Yeah, that was what I used to kill the processes. I didn't use Zeppelin yet, but I will try it soon. I would still think there should be a way to overcome this problem somehow though. – V. Samma Feb 08 '17 at 11:29
  • might be but I don't know it :). did you initially killed them with kill or after running yarn application -kill? because that might have caused issues... – Tal Joffe Feb 08 '17 at 11:33
  • Which way? I tried yarn application -kill and then tried to run spark shell again. It didn't work, so I tried to kill this new application from Resource Manager. Then just in case restarted the resource manager and made sure all the Spark processes were killed also. Still, same problem. – V. Samma Feb 08 '17 at 15:15
  • oh o.k. so sorry nothing smart to add other than what I said – Tal Joffe Feb 08 '17 at 15:19