0

We have our web server cluster in the AWS's US-WEST-2 region reading/writing to Postgres RDS in the same region. As per AWS's SLA a region can be down for 22 mins in a month.

To mitigate this downtime of 22 min when ever it happens. I am setting up another cluster in US-EAST-1 region with it's own RDS.

To have both the clusters in sync I want the PUT requests to be relayed to both the clusters in US-WEST-2 and US-EAST-1. Is there a web proxy/AWS service which can help me?

KunalC
  • 101
  • *"As per AWS's SLA a region can be down for 22 mins in a month."* This is the number of minutes of downtime per month at 99.95% availability, and is the first threshold at which AWS makes financial concessions for a lack of availability... but you seem to be treating it as though it had actual meaning as a number of anticipated downtime. **22 minutes is not a meaningful number in any sense for setting actual uptime expectations.** It has essentially nothing to do with how reliable the services are. The us-west-2 region has, in my memory, never had a region-wide outage of *any* duration. – Michael - sqlbot Aug 02 '17 at 23:26
  • @Michael-sqlbot Even though us-west-2 never had a region wide outage, the purpose of this project is to plan for it in case it happens as I want my service to be 99.99% available. – KunalC Aug 02 '17 at 23:35
  • I am not saying it can't have an outage. S3 never had a 5 hour outage in us-east-1 until, one day, it did. My point is that an outage is statistically most likely to be much shorter or much longer. 22 minutes has no meaning. – Michael - sqlbot Aug 02 '17 at 23:43

1 Answers1

1

Architecturally the solution you seem to have decided on is difficult to implement and isn't typically the best way to approach things.

A better approach is often to have more than one set / cluster of web servers that aren't in the same building, and to keep your database synchronized.

Availability Zones

The easiest way to achieve this is to take advantage of availability zones (AZs). AZs are independent data centers in the same area with very low latency between them, each AZ has independent network and power feeds. It's rare for multiple AZs to fail at one time, but it can happen - my feeling is it might happen every couple of years.

The advantage of using AZs is AWS makes it easy to increase application reliability using AZs within a single region. You can use a load balancer and distribute traffic to web / application servers in multiple AZs, each serving traffic. RDS has synchronous replicas across AZs in a region, so any data written in AZ A is applied to AZ B immediately. If AZ A fails the RDS instance in AZ B comes up in between seconds to a small number of minutes.

Multi Region

Multi region architectures are more difficult. You load balance or fail-over with Route53, which means load can be distributed or can fail over.

You can't have synchronous replicas with RDS across regions. You can do read replicas across regions, and you could manually promote a database to master if required. Once the outage resolves getting things running back in the original configuration can be problematic as the old master DB is out of sync with the new master. If you want multi master databases across regions I suspect you'll have to run them yourself on EC2 instances.

TLDR

Use AZs to increase reliability for standard failures. If you need exceptional availability then you can use another region as either a hot or a cold site, depending on your RTO (recovery time objective).

Tim
  • 31,888
  • 7
  • 52
  • 78
  • We are already using multi-AZs to increase reliability within the region. The plan is to add another hot site in different region to account for region failure. – KunalC Aug 02 '17 at 23:15
  • In that case I think your primary concern should probably be data synchronization, both to the second region (easy) and then how to make the first region primary again (more difficult). I'd look into MySQL replication within RDS, rather than RDS replication. I'd also really want to consider RPO/RTO and look at how often failures occur before I did this, because a hot site could increase your costs significantly, though auto scaling may mitigate that somewhat. If it's "hot" you need an RDS sized for the full production load. – Tim Aug 02 '17 at 23:24