This is an odd one. I have an ECS service using Fargate v1.4 in a private subnet. Since the tasks don't have access to the Internet, I had to configure VPC Endpoints so that tasks could load what they needed from AWS services (e.g. secrets from SSM, the image from ECR, etc.). This was all and good and worked just fine, until it didn't. I'm not sure what changed, but one weekend I noticed my servers weren't running anymore and I noticed this error in the console:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secrets from ssm: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://ssm.us-ea...
That looked familiar from when I was configuring the VPC endpoints, so I went through the console to make sure nothing changed. As far as I can tell, the configuration looks right (security groups have the proper ingress/egress rules, proper endpoints are configured and connected to the VPC my servers are in, everything is in the same AZ, IAM roles have access to the secret).
As an experiment, I removed the secrets I was trying to load from the task definition to see what would happen. When a new server spun up, I saw a similar error, but this time for loading the image from ECR:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://api.ecr....
I also tried to delete and recreate all of the endpoints, just in case, and still no success.
Other (potentially) useful information:
- Region: us-east-1
- I'm using the latest version of Pulumi
- I'm using app autoscaling to spin down the instances during the week
Any help/tips would be appreciated.