2

Say nginx on an EC2 instance crashes. The instance is healthy and CloudWatch Metrics are great, but all the domains hosted on the server are now "Connection refused".

This seems like a very basic function - monitoring to ensure a website is returning a 200. Is this somewhere in CloudWatch? I would think something could just curl -s -o /dev/null -w "%{http_code}" http://www.example.org/ and if it doesn't receive a return code of 200, say 5 times in a row, it will trigger an instance restart and SNS notification.

Perhaps there is something I should be running on the EC2 instance that would restart nginx if something is unreachable? Either way, I'd love to know how to do this with an AWS resource, so I could even monitor any site and kick off an SNS.

Sorry if I'm missing something easy here. It just seems this would something easily searched, but I have spent hours across months trying to figure this out.

Neal
  • 23
  • 5

2 Answers2

2

This is typically the job for a load balancer (ALB or ELB) that can detect whether the web server on the instance is running or not and if not you can trigger some action through CloudWatch. Again, typically, an instance replacement through Auto Scaling Group.

It’s perfectly normal to use ASG and ALB even if you need only a single instance.

Alternatively you can create Custom CloudWatch metrics using the CW agent installed on the instance. Then you can report anything you want.

Hope that helps :)

MLu
  • 24,849
  • 5
  • 59
  • 86
  • Thanks! I actually was thinking this in bed. I unfortunately treat that instance as a pet instead of an immutable environment. I'm dying for job experience with AWS... where I'd learn how things are done 'better' or the right way. – Neal Jan 08 '20 at 22:57
  • Right now, the AWS environment and instance are created using terraform and the ubuntu instance is configured with a BASH script. I haven't figured out how to automate that yet, regarding TLS certificates. – Neal Jan 08 '20 at 23:00
  • I will do some research on the Customer CloudWatch metrics. Thanks for that tip! – Neal Jan 08 '20 at 23:01
  • @Neal If you use ALB you can use Amazon Certificate Manager (ACM) to issue and renew your certs and ALB can handle the TLS termination. That will simplify your cert management and ease the load on your instance. – MLu Jan 08 '20 at 23:06
  • Good point, I just did some reading on ACM (studying the AWS Security Specialty curriculum). The potential cost was my only hesitation. Either way, it would be really good experience for me to practice using the ACM resource instead of certbot. Thx again! – Neal Jan 09 '20 at 00:18
1

IMHO, replacing an instance because Nginx has stopped responding isn't a good engineering solution. Instance replacement can take several minutes, so relying on AWS to do this will mean your service is offline during that time, whereas a simple Nginx reload takes less than 1 sec.

Nginx is a very, very robust technology. If its failing to the point where you're looking at AWS solutions for reliability, you probably need to go back and look at your Nginx setup. I appreciate you want to learn about AWS, but I don't think this is a good use case.

To answer the question: there are myriad ways to do site reliability in AWS. If you want to do it with a single instance and no extra cost I would recommend ElasticBeanstalk as a turn key solution. It will apply all necessary reliability mechanisms you need based on a health check you provide. You can also leverage Docker in ElasticBeanStalk, which is the ultimate destination of all SRE operations.

Garreth McDaid
  • 3,449
  • 1
  • 27
  • 42