We are using Amazon Elastic Compute Services to spin up a cluster with autoscaling groups. Until very recently, this has been working fine, and generally it is still working fine... Except that we are no longer able to connect to the underlying EC2 instances using SSH with our keypair. We get ssh permission denied errors, which is relatively (weeks) new, and we have changed nothing. By contrast, we can spin up an EC2 instance directly and have no problem using SSH with the same keypair.
What I have done to investigate:
- Drained the ECS cluster, detached the instance from it, and stopped it.
- Detached the instance's root volume and attached it to a different EC2 instance.
- Observed that
/home/ec2-user/.ssh
does not exist. - Found the following error in the instance's /var/log/cloud-init.log:
Oct 30 23:23:09 cloud-init[3195]: handlers.py[DEBUG]: start: init-network/config-ssh: running config-ssh with frequency once-per-instance
Oct 30 23:23:09 cloud-init[3195]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-0e13e9da194d2624a/sem/config_ssh - wb: [644] 20 bytes
Oct 30 23:23:09 cloud-init[3195]: helpers.py[DEBUG]: Running config-ssh using lock (<FileLock using file '/var/lib/cloud/instances/i-0e13e9da194d2624a/sem/config_ssh'>)
Oct 30 23:23:09 cloud-init[3195]: util.py[WARNING]: Applying ssh credentials failed!
Oct 30 23:23:09 cloud-init[3195]: util.py[DEBUG]: Applying ssh credentials failed!
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh.py", line 184, in handle
ssh_util.DISABLE_USER_OPTS)
AttributeError: 'module' object has no attribute 'DISABLE_USER_OPTS'
Oct 30 23:23:09 cloud-init[3195]: handlers.py[DEBUG]: finish: init-network/config-ssh: SUCCESS: config-ssh ran successfully
- Examined the Python source code for /usr/lib/python2.7/site-packages/cloudinit. It looks OK to me; I see the reference in config/cc_ssh.py to
ssh_util.DISABLE_USER_OPTS
and it looks likessh_util.py
does indeed contain DISABLE_USER_OPTS as a file-level variable. (But I am not a master Python programmer, so I might be missing something subtle.) - Curiously, the compiled versions of ssh_util.py and cc_ssh.py date from October 16, which raises all sorts of red flags, because we had not seen any problems with ssh until recently. But I loaded uncompyle6 and decompiled those files, and the decompiled versions seem to be OK, too.
Looking at cloud-init, it's pretty clear that if the reference to ssh_util.DISABLE_USER_OPTS
throws an exception, the .ssh directory won't be configured for ec2-user, so I understand what's happening.
What I don't understand is why? Has anyone else experienced issues with cloud-init with recently-created EC2 instances under ECS, and found a workaround?
For reference, we are using AMI amzn2-ami-ecs-hvm-2.0.20190815-x86_64-ebs (ami-0b16d80945b1a9c7d)
in us-east-1, and we certainly not seen these issues as far back as August 15. I assume that some cloud-init change that the instance gets via a yum update
explains the new behavior and the change to the write dates of the compiled Python modules in cloud-init.
I should also add that the EC2 instance I spun up to mount the root volume of the ECS-created instance has subtly-different cloud-init code. In particular, the cc_ssh.py module doesn't refer to ssh_util.DISABLE_USER_OPTS
but rather to a local DISABLE_ROOT_OPTS
variable. So this is all suspicious.