nvidia-docker for aws cloudformation template. Ec2 instances keep shutting down

Question

I have made changes to the docker for aws cloudformation template to change the ami to https://aws.amazon.com/marketplace/pp/Amazon-Web-Services-Deep-Learning-AMI-Ubuntu-1604/B077GCH38C for the availability of nvidia docker and changed the instance type to g3.4xlarge. I made a bunch of other tweaks as well.

When I create the stack, I can ssh into an instance, and docker swarm is initialized and has access to all the nodes. There are no error logs. But, periodically, the EC2 instances get shut down without any informative logs in the system log of the terminated instances.

I was wondering if anyone has any idea why this may be happening

Here is my cloudformation template:

pastebin.com/5465RgSN

Updated clarification: The stack is supposed to create 3 nodes (3 manager, 0 workers). A few minutes after the creation of the stack, the EC2 instances begin to shut-down and in their place, new instances get created and join the swarm. When I ssh into an EC2 instance, I usually have 2-3 minutes until it gets shut down.

Can you better describe what you mean by periodically? Are the instances replaced? I see lifecycle hooks defined and swarm cleanup in the cloudformation template. Is it possible that what you're experiencing is the intended behavior? — TheClassic, Dec 04 '19 at 23:15
I have edited the post with an updated clarification. I believe the intended purpose is that new instances should get created if one of the nodes goes down. However, I am not sure why the nodes automatically get shut down. — Sina Motevalli Bashi, Dec 05 '19 at 03:00
The issue is resolved. It was ELB health check making http requests to a port that was not open. Changed ELB Health check target from HTTP:44554 to TCP:22. Works now — Sina Motevalli Bashi, Dec 10 '19 at 19:17

nvidia-docker for aws cloudformation template. Ec2 instances keep shutting down

0 Answers0