8

This is an odd one. I have an ECS service using Fargate v1.4 in a private subnet. Since the tasks don't have access to the Internet, I had to configure VPC Endpoints so that tasks could load what they needed from AWS services (e.g. secrets from SSM, the image from ECR, etc.). This was all and good and worked just fine, until it didn't. I'm not sure what changed, but one weekend I noticed my servers weren't running anymore and I noticed this error in the console:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secrets from ssm: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://ssm.us-ea...

That looked familiar from when I was configuring the VPC endpoints, so I went through the console to make sure nothing changed. As far as I can tell, the configuration looks right (security groups have the proper ingress/egress rules, proper endpoints are configured and connected to the VPC my servers are in, everything is in the same AZ, IAM roles have access to the secret).

As an experiment, I removed the secrets I was trying to load from the task definition to see what would happen. When a new server spun up, I saw a similar error, but this time for loading the image from ECR:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://api.ecr....

I also tried to delete and recreate all of the endpoints, just in case, and still no success.

Other (potentially) useful information:

  • Region: us-east-1
  • I'm using the latest version of Pulumi
  • I'm using app autoscaling to spin down the instances during the week

Any help/tips would be appreciated.

c1moore
  • 1,827
  • 17
  • 27
  • No. Basically both SGs allow 443 ingress and all egress. – c1moore May 31 '20 at 01:24
  • Tasks are still in private subnets? They didn't relaunch in public ones? – Marcin May 31 '20 at 01:25
  • No, they're still in a private subnet. – c1moore May 31 '20 at 01:28
  • DNSHostnames and DNSSupport still enabled for the vpc? – Marcin May 31 '20 at 01:37
  • Yes, both are enabled. – c1moore May 31 '20 at 01:38
  • ECR also requires S3 gateway endpoint. I guess its also fine? – Marcin May 31 '20 at 01:39
  • Yes, the s3 endpoint is also configured (and the tasks have permission to pull from starport). – c1moore May 31 '20 at 01:48
  • Maybe have to go back to basics. If you spin up an instance in the same subnet as your tasks, and try to use aws cli to get the ssm, or access ecr, this does this also not work? Just want to check if the issue is limited to ecs, or its at the vpc level. – Marcin May 31 '20 at 01:51
  • I'll try to get some time to set up a bastion tomorrow. – c1moore May 31 '20 at 03:18
  • I finally got a chance to spin up a bastion and test instance. The gateway endpoint to S3 works, but the interface endpoint to SSM is just hanging, so still looks like it's an issue with the endpoints. – c1moore May 31 '20 at 21:32
  • I ran `nslookup ssm.us-east-1.amazonaws.com` and the non-authitative answer provided the IP address of the ENI associated with the VPC endpoint, so at least it's pointing to the right place. @Marcin (in case you haven't seen these comments yet) – c1moore Jun 01 '20 at 00:29
  • At least you know now where to focus on troubleshooting. Wonder what could have changed with ssm interface endpoint that it does not work? Its policy, SG? – Marcin Jun 01 '20 at 00:34
  • I keep going back to the policy, SG, and NACLs and nothing looks like it would restrict access. NACLs allow everything, SG allows 443 inbound from the subnet's cidr and all outbound. I enabled the logs for the ENI and nothing is showing up. I also turned on notifications for the vpc endpoint and it's not triggering. This is an interesting one for sure. – c1moore Jun 01 '20 at 00:43
  • In the aws cli you can manually specify custom interface endpoint for ssm, as shown for example [here](https://docs.aws.amazon.com/vpc/latest/userguide/vpce-interface.html#access-service-though-endpoint). If you use the endpoint dns, instead of standard one, do you observe any difference? – Marcin Jun 01 '20 at 00:46
  • Yes, I tried to use the IP address (instead of the custom domain) and got the same result. – c1moore Jun 01 '20 at 00:53
  • I checked docs for [ssm vpc endpoints](https://docs.aws.amazon.com/systems-manager/latest/userguide/setup-create-vpc.html#sysman-setting-up-vpc-create) and they create more than 1. Maybe you also need some of the other ones as well? – Marcin Jun 01 '20 at 00:56
  • I have all of those (I don't think I need the ec2 ones, but just to be safe). – c1moore Jun 01 '20 at 01:10
  • 2
    I think I figured it out. I guess it worked just by luck before. The CiDR range on the SG did not include the full CIDR range available in the subnet, so the newer instances that were being spun up were outside of the allowed range. Thanks for your help @Marcin, much appreciated. – c1moore Jun 01 '20 at 01:16
  • Nice. So the problem was SG after all :-) – Marcin Jun 01 '20 at 01:17
  • Can I provide the answer, or you would prepare to do it yourself for future reference? – Marcin Jun 01 '20 at 01:20
  • 1
    Sure, I plan on keeping this around for anybody that stumbles across it, our debugging steps will probably be helpful for somebody else (there's not a ton out there around VPC endpoints). – c1moore Jun 01 '20 at 01:27

3 Answers3

7

Based on the discussion in comments, the cause for the issue was determined to be incorrect CIDR range on the security groups (SGs) for the SSM VPC service endpoint.

General troubleshooting recommendation for the issue are:

  • check the ingress rules on the SGs for the VPC interface endpoint (port 443 open).
  • ensure that S3 gateway endpoint is also available and working as it is required by SSM.
  • check if enableDnsHostnames and DNSSupport are enabled for the VPC
  • create an instance in the same subnet as the ECS service. Use the instance (after setting up its role with permissions to SSM) to check the SSM interface connectivity. The aim of this is to verify whether the issue is at VPC level or at ECS level.

  • in the instance, AWS CLI can be used to connect to the SSM endpoint using custom interface URL or the general one for the SSM.

Marcin
  • 215,873
  • 14
  • 235
  • 294
6

Auto-assign public IP is disabled when create Fargate Task make this error too. So you need to enable Auto-assign public IP it to make thing work.

If you're running a task using the Fargate launch type in a public subnet, then choose ENABLED for Auto-assign public IP when you launch the task. This allows your task to have outbound network access to pull an image. Source

Don't know the detail, but hope it could help who come here from search engine.

KhoaHV
  • 61
  • 1
  • 3
  • 1
    There's probably some security ramifications to this. – Nate Symer Mar 05 '21 at 20:31
  • 3
    I've been with an AWS support agent for half a day and this turned out to be the solution. It seems obscured, undocumented and hard to understand why this option is require, since if you disable such option Fargate doesn't work at all. I'd have to ask for further clarification – Khoa Apr 13 '21 at 07:22
0

I had a simillar error which I fixed by adding an Internet Gateway to my VPC. I'm unsure if there was another way of fixing it.

Daniel Silva
  • 372
  • 2
  • 4
  • 14