How much failover redundancy is enough?

Question

I'm working on a client-server system where all clients currently submit their transactions to essentially a single west-coast IP address to reach what is called the "gateway" application. The gateway does some accounting and dispatches each transaction to any one of a number of multiple database servers for final processing. The servers return their results direct to the client (not back out through the gateway).

The plan is to add a second gateway on the east coast, for redundancy and failover. It will normally only be on stand-by, designed to take over and become the actual gateway should the working gateway fail, essentially the classic configuration illustrated here.

Some participants are arguing that having only one stand-by gateway is insufficient, and we should also implement a second stand-by gateway, say in the midwest. Others are arguing that the extra cost, complexity, and management of two stand-bys is unnecessary, and that the simultaneous unavailability of gateways on both coasts is so unlikely as to not be a concern.

What is considered best practice? Just how much redundancy (in terms of physically separate access points available to clients) is typically considered nominal? Are dual failures common enough that having only one stand-by is frequently regretted?

EDIT: Regarding "calculating" cost vs. benefit for the amount of redundancy I need or want, I guess it's better to rephrase my question as:

Where are statistics indicating the frequency with which a geographically separate collection of IP addresses are simultaneously unreachable?

In other words, a table like

On average, 1 west coast IP + 1 east cost IP
are simultaneously unreachable 1 day/year.
On average, 1 west IP + 1 east IP + 1 southern IP
are simultaneously unreachable 1 hr/year.
On average, 1 west IP + 1 east IP + 1 southern IP + 1 northern IP
are simultaneously unreachable 1 minute/year.
etc.

makes it fairly easy to choose the amount of desired redundancy, because there's an actual basis from which to calculate costs vs. performance. (I guess "simultaneously unreachable" has to mean "to a substantial number of clients randomly scattered around the country", since a single client could be unable to reach any servers regardless of how many there are because of her own local network failure.)

However, without such a table, any redundancy vs. performance calculations would just be guesswork. So: are there any sources of real life availability data on which such calculations can be based? Or does everyone just guess what they'll need, and expand as necessary once they find out they guessed low, or cut back if they guessed high?

It would seem companies offering fault-tolerant products would want to collect and promote such data. On the other hand, maybe the data would show 99.99% of fault-tolerant customers don't really need much redundancy at all. For example, if I can go for a full year and my east and west IP address are never simultaneously unreachable, I'm not going to bother considering adding a midwest IP.

I also realize there's a distinction between an IP address being unreachable due to forces external to my site, and an IP address being down because my site has failed internally. Internal failures (on my side of the IP address) are relatively easy to deal with. External failures (on the client side of the IP address, such as California going offline due to earthquakes, or New York going offline during a hurricane) I can deal with only by having extra IP addresses in some other geographic location. That is the probability I'm hoping to quantify. For now, I'm leaning toward the camp that claims the likelihood that east and west IP address are simultaneously unreachable is too small to be concerned with.

It really boils down to a function of how much redundancy you need, and how much you're willing to pay for the privilege. No one can answer that but you guys - it's just a cost-benefit analysis, really. — HopelessN00b, Feb 10 '14 at 23:06
Are the database services redundant as well, or would a given request always need to communicate to a specific one of the database servers? — Shane Madden, Feb 10 '14 at 23:06
@Shane: all the data servers are identical, redundant, and essentially read-only, so a given request can go to any of them. They're not an issue at all. — Witness Protection ID 44583292, Feb 11 '14 at 00:52

Vasili Syrakis · Answer 1 · 2014-02-11T01:16:38.463

What @HopelessN00b said. You have to weigh up the raw Cost VS Benefit for yourself.

Some customers will literally turn a computer off for a specific period of time to save costs, because they don't get any traffic at all during downtime.
Some customers will need a load balanced cluster, with a failover instance in a separate datacentre, plus a third network in another datacentre to act as a witness, and a guarantee from their providers for 100% 24/7/365 uptime with no exceptions.

You have to calculate:

How many hours out of the day do I need to be online?
How much $$$ do we lose if we are offline for X hours/minutes?
Is it worth spending another $5000 per month for DR if I am only losing $250 per hour, and I only anticipate 5 hours of downtime per month? (99.9926% availability)
Et cetera

There is no best practice for this.

Where are statistics indicating the frequency with which a geographically separate collection of IP addresses are simultaneously unreachable?

This too, depends. For instance, are we talking about statistics for customers that don't have a UPS, or their own Generator? or even two independent power lines coming from separate substations?

That comes into the equation too. Our company had a blackout because of a total power outage that was so lengthy that our UPS ran out of juice.
We proceeded to purchase a generator for our entire datacentre which lasts X hours, with the ability to recharge via fuel drop-off during emergencies, so that even if the local subsystem is completely knocked out, we can keep going almost indefinitely.

maybe the data would show 99.99% of fault-tolerant customers don't really need much redundancy at all.

Totally.
I have customers that run critical ($$$) systems on a single server, in a single location, and their server is rock-solid because it just performs one function. The less complication, the better.

It's the old ironic situation where you add a DR solution, and then you experience more outages than ever before.

score 4 · Answer 2 · answered Feb 11 '14 at 00:05

As has already been said, there is no generic best practise here at the technical level, aside from the obvious list of things not to do.

A lot will be informed by any SLAs that you explicitly have with your clients or that is likely to be assumed in their industry - essential you need to make sure you can support that under all but the most exceptional circumstances, and afford any recompense you need to make should a most exceptional circumstance happen. For instance with some of our clients we have a four hour recovery window with 24 hour days loss being "acceptable" (which is very easy to ensure), for another project that is far more real-time those timings are ten and thirty minutes, and I can imagine mission critical and/or safety services having far stricter expectations than that.

The only generic advise I can think of is make sure you have the basics of everything covered to a certain level before spending time and money on one specific point. Having the most redundant failsafe database layer on the planet doesn't help you when the one public link to your web farm dies. So try not to overly protect one party of the system at the expense of others.

score 0 · Accepted Answer · answered Feb 11 '14 at 17:35

Our first web server began in city X in 1995 on a Centrex connection, which converted to ISDN in 1998, and then to DSL in 2001, when we also started a second static address in city Y a few miles away for backup. Although we were using two different ISP's the underlying network was all PacBell, now ATT. Our city X facility was vacated in 2003 and only city Y ran our server until 2009, when we started another static address in city Z, again just a few miles from city Y, and both Y and Z are now even using the same ISP.

In all those years, our IP addresses were never "externally" (as you put it) unreachable, as far as we could ever tell. Apparently PacBell/ATT and our ISP have always had sufficient redundancy that they could always at least deliver our packets. "Internally" the only problems we've had were power failures, not even machine failures, and just temporarily switching DNS pointers between the two locations during those kinds of incidents (for a few days maybe once every couple of years) has been sufficient for our purposes.

If you get a west coast IP and an east coast IP I predict your clients (as a group) will probably never see those addresses be simultaneously unreachable. If both locations are unreachable (in other words packets can't even be sent there), then Armageddon has probably arrived and you'll have bigger problems anyway. Just make sure you have policies and procedures in place (and tested) to get back up ASAP should you have an internal failure at either site, and don't worry about getting a third midwest IP until circumstances somehow prove it's truly necessary.

How much failover redundancy is enough?

3 Answers3