5

I'm trying to launch a standalone Spark cluster using its pre-packaged EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state:

ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k <key-pair> -i <identity-file>.pem -r us-west-2 -s 3 launch test
Setting up security groups...
Searching for existing cluster test...
Spark AMI: ami-ae6e0d9e
Launching instances...
Launched 3 slaves in us-west-2c, regid = r-b_______6
Launched master in us-west-2c, regid = r-0______0
Waiting for all instances in cluster to enter 'ssh-ready' state..........

Yet I can SSH into these instances without complaint:

ubuntu@machine:~$ ssh -i <identity-file>.pem root@master-ip
Last login: Day MMM DD HH:mm:ss 20YY from c-AA-BBB-CCCC-DDD.eee1.ff.provider.net

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
There are 59 security update(s) out of 257 total update(s) available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2014.09 is available.
root@ip-internal ~]$

I'm trying to figure out if this is a problem in AWS or with the Spark scripts. I've never had this issue before until recently.

Greg Dubicki
  • 5,983
  • 3
  • 55
  • 68
nmurthy
  • 1,337
  • 1
  • 12
  • 24
  • 1. Where are you SSH-ing into the cluster from? 2. Where are you launching the cluster from? 3. Are you sure all the nodes in the cluster are accessible by SSH? 4. Does this happen consistently? – Nick Chammas Jan 17 '15 at 18:39

4 Answers4

4

Spark 1.3.0+

This issue is fixed in Spark 1.3.0.


Spark 1.2.0

Your problem is caused by SSH silently stopping because of conflicting entries in you SSHs known_hosts file.

To resolve your issue add -o UserKnownHostsFile=/dev/null to your spark_ec2.py script like this.


Optionally, to clean up and avoid running into problems with connecting to your cluster with SSH later on I recommend you to:

  1. Remove all the lines from ~/.ssh/known_hosts that include EC2 hosts, for example:

ec2-54-154-27-180.eu-west-1.compute.amazonaws.com,54.154.27.180 ssh-rsa (...)

  1. Use this solution to stop checking and storing the fingerprints of temporary IP of your EC2 instances at all
Community
  • 1
  • 1
Greg Dubicki
  • 5,983
  • 3
  • 55
  • 68
  • I did not need to remove all of the known AWS hosts as setting the `UserKnownHostsFile` to `/dev/null` is enough to correct the problem where the ssh process fails silently and appears to hang. – cfeduke Jan 23 '15 at 21:34
  • @cfeduke thanks, I edited the answer to separate necessary and optional steps (and more :). – Greg Dubicki Jan 25 '15 at 12:43
  • 1
    I opened an issue in Spark's JIRA & a PR with my change: https://issues.apache.org/jira/browse/SPARK-5403. Please vote on it if you're affected! – Greg Dubicki Jan 25 '15 at 13:28
  • I followed all the steps, waited for 2+ hours and bang! cluster started. A lot of patience needed. – pcv Jan 27 '15 at 23:27
  • @pcv I am glad it works for you. :) But it's strange that so slow. For me it takes at most 10 minutes for a cluster with 10 quite big slaves. What kind of cluster are you launching? – Greg Dubicki Jan 28 '15 at 14:30
  • @GrzegorzDubicki surprisingly just a master and one slave for playing around. I agree it's strange. – pcv Jan 29 '15 at 02:27
  • Thanks @GrzegorzDubicki. Issue resolved in pull request 4196 (https://github.com/apache/spark/pull/4196), fixed in Spark 1.3.0 – nmurthy Mar 10 '15 at 04:52
  • I get this same issue even with the change (its now been pulled into the spark-ec2 scripts). Any other ideas folks? – Bob Apr 26 '15 at 01:12
  • I have the same issue, but I'm using 1.3.1. `Warning: SSH connection error. (This could be temporary.) Host: ec2-[deleted for privacy] SSH return code: 255 SSH output: ssh: connect to host ec2-[deleted for privacy] port 22: Connection refused . Cluster is now in 'ssh-ready' state. Waited 486 seconds.` – Frank B. Sep 09 '15 at 13:12
  • 1
    I think that it's another problem, @Frank B. See if http://stackoverflow.com/a/14885975/2693875 helps. – Greg Dubicki Sep 10 '15 at 08:17
2

I had the same problem and I followed all the steps mentioned in the thread (mainly adding -o UserKnownHostsFile=/dev/null to your spark_ec2.py script), still it was hanging saying

Waiting for all instances in cluster to enter 'ssh-ready' state

Short answer:

Change permission of the private key file and rerun the spark-ec2 script

[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% chmod 0400 /tmp/mykey.pem

Long Answer:

To troubleshoot, I modified spark_ec2.py and logged the the ssh command used and tried to execute it on command prompt, it was the bad permission on the key:

[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/mykey.pem  -o ConnectTimeout=3 uroot@52.1.208.72 
Warning: Permanently added '52.1.208.72' (RSA) to the list of known hosts.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for '/tmp/mykey.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /tmp/mykey.pem
Permission denied (publickey).
spar128
  • 21
  • 1
1

I just ran into the same exact situation. I went into the python script at def is_ssh_available() and had it dump out the return code and cmd.

except subprocess.CalledProcessError, e:
print "CalledProcessError "
print e.returncode
print e.cmd

I had the key file location as ~/.pzkeys/mykey.pem - as an experiment, I changed it to fully qualified - i.e. /home/pete.zybrick/.pzkeys/mykey.pem and that worked ok.

Right after that, I ran into another error - I tried to use --user=ec2-user (I try to avoid using root), then I got a permission error on rsync, removed the --user-ec2-user so it would use root as default, did another attempt with --resume, ran to successful completion.

Greg Dubicki
  • 5,983
  • 3
  • 55
  • 68
1

I used the absolute (not relative) path to my identity file (inspired by Peter Zybrick) and did everything Grzegorz Dubicki suggested. Thank you.

000
  • 26,951
  • 10
  • 71
  • 101
nmurthy
  • 1,337
  • 1
  • 12
  • 24