Common AWS failures - Handling AZ failover

Question

Specifically I have a question what is the recommended way to organize AZ failover in AWS environment. Also it will be good to understand typical AWS failures in order to organize Application HA (High Availability). So, Application architecture (AWS services usage) is following: It's more/less typical Web Applications architecture in the AWS

There is route 53 that resolves ip of some ELB.
There is public subnet that has ELB and it routes traffic to Web Servers to private VPC;
In the private subnet traffic goes: Web Servers -> ELB-> Application Servers;
Application Servers writes data to Multi-AZ RDS.

The main drawback with such deployment that services are active in one AZ because in a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone. So, master is only in one AZ and services in another AZ is not allowable to write to RDS because it's standby.

Two questions:

What is the better way to implement HA for such deployment?
What is the common AWS failures (if one AZ is unavailable whether it's often happens only with some services (e.g. VPC/EC2/EBS other issues?)or usually it's whole AZ specific services are not available)?

Considerations about HA for such approach:

RDS. From AWS docs: "In the event of a planned or unplanned outage of your DB instance, Amazon RDS automatically switches to a standby replica in another Availability Zone if you have enabled Multi-AZ. The time it takes .....". So, AWS Automatically will change RDS Master.
Active/Not active AZ. Different health checks can be added to Route53 and basically make Active another AWS AZ. But How to make it synchronously with RDS (only after RDS becomes master in another AZ make this AZ active)?

Update Another reason to maintain one active and one passive AZ is that our application servers should support stickiness by device IP address (e.g. It keeps session based on user's or device's IP). And we have 1 EC2 Web Server instance in each AZ that maintains it (we can't allow to go requests to different AZ(s)).

score 3 · Accepted Answer · answered Aug 25 '17 at 15:19

3

I think you misunderstand how availability zones work. Services in one AZ can connect to the RDS master in a different AZ. You should have all services running in at least 2 AZs.

For RDS, when then master fails or the AZ the master is in goes down, the RDS service will promote the standby to master and update the DNS for the RDS endpoint so that the endpoint will then point to the new master.

All you code needs to do in order to handle an RDS failover is to gracefully handle sudden DB disconnects with a retry.

answered Aug 25 '17 at 15:19

Mark B

183,023
24
297
295

What do you mean by "IP based" stickiness exactly? – Mark B Aug 25 '17 at 16:12
Requests that come from particular client IP (not based on cookie) goes to the same instance of application server. – user1459144 Aug 25 '17 at 16:15
How are you achieving IP stickiness with an ELB? Are you bypassing the ELB? Do you only have one EC2 instance per ELB? – Mark B Aug 25 '17 at 16:21
1

Why do you even care about IP stickiness? You want to make sure all users that share an internet connection always use the same app server? Why is cookie stickiness not an option? I think you will need to rethink your session handling if you want to achieve proper failover support in the cloud. – Mark B Aug 25 '17 at 16:29
Yeah it's true. The main issue that clients are not only browsers and it doesn't work with cookies like browsers (that's why cookie stickiness is not an option). – user1459144 Aug 25 '17 at 17:00
1

Perhaps you should replicate sessions across servers instead of using sticky sessions? – Mark B Aug 25 '17 at 17:16
Thx for suggestions (appreciate your help)! Even this discussion is too big for one post :) – user1459144 Aug 25 '17 at 20:33

Common AWS failures - Handling AZ failover

1 Answers1