I'm working on a client-server system where all clients currently submit their transactions to essentially a single west-coast IP address to reach what is called the "gateway" application. The gateway does some accounting and dispatches each transaction to any one of a number of multiple database servers for final processing. The servers return their results direct to the client (not back out through the gateway).
The plan is to add a second gateway on the east coast, for redundancy and failover. It will normally only be on stand-by, designed to take over and become the actual gateway should the working gateway fail, essentially the classic configuration illustrated here.
Some participants are arguing that having only one stand-by gateway is insufficient, and we should also implement a second stand-by gateway, say in the midwest. Others are arguing that the extra cost, complexity, and management of two stand-bys is unnecessary, and that the simultaneous unavailability of gateways on both coasts is so unlikely as to not be a concern.
What is considered best practice? Just how much redundancy (in terms of physically separate access points available to clients) is typically considered nominal? Are dual failures common enough that having only one stand-by is frequently regretted?
EDIT: Regarding "calculating" cost vs. benefit for the amount of redundancy I need or want, I guess it's better to rephrase my question as:
Where are statistics indicating the frequency with which a geographically separate collection of IP addresses are simultaneously unreachable?
In other words, a table like
On average, 1 west coast IP + 1 east cost IP
are simultaneously unreachable 1 day/year.
On average, 1 west IP + 1 east IP + 1 southern IP
are simultaneously unreachable 1 hr/year.
On average, 1 west IP + 1 east IP + 1 southern IP + 1 northern IP
are simultaneously unreachable 1 minute/year.
etc.
makes it fairly easy to choose the amount of desired redundancy, because there's an actual basis from which to calculate costs vs. performance. (I guess "simultaneously unreachable" has to mean "to a substantial number of clients randomly scattered around the country", since a single client could be unable to reach any servers regardless of how many there are because of her own local network failure.)
However, without such a table, any redundancy vs. performance calculations would just be guesswork. So: are there any sources of real life availability data on which such calculations can be based? Or does everyone just guess what they'll need, and expand as necessary once they find out they guessed low, or cut back if they guessed high?
It would seem companies offering fault-tolerant products would want to collect and promote such data. On the other hand, maybe the data would show 99.99% of fault-tolerant customers don't really need much redundancy at all. For example, if I can go for a full year and my east and west IP address are never simultaneously unreachable, I'm not going to bother considering adding a midwest IP.
I also realize there's a distinction between an IP address being unreachable due to forces external to my site, and an IP address being down because my site has failed internally. Internal failures (on my side of the IP address) are relatively easy to deal with. External failures (on the client side of the IP address, such as California going offline due to earthquakes, or New York going offline during a hurricane) I can deal with only by having extra IP addresses in some other geographic location. That is the probability I'm hoping to quantify. For now, I'm leaning toward the camp that claims the likelihood that east and west IP address are simultaneously unreachable is too small to be concerned with.