3

There was a outage in July 2009 of Authorize.Net's websites because of a local fire. If you went to their website during that time there was a notice or redirection to view status updates on their Twitter account. That seemed like a good solution.

That got me thinking. For the websites I manage, in their current setup, if my host lost total internet connection the user would see a 'Server not found' error in their browser. I'd hate to have visitors think the company was no longer in business. I'd favor having the visitor see some kind of 'Unplanned outage' page.

Currently I'd have to:

  1. Notice the site was down (ip monitoring)
  2. Update the Nameserver's DNS records to point to another host (hopefully already setup)
  3. Wait for the new DNS records to propagate (25 mins - 48hrs)

This seems like a horrible solution. I know there has to be a better way of doing this.

Question #1: What is a solution to avoid this?

An idea I had would be to have Nameserver 1 & 2 pointed to nameservers physically located where the website is hosted. And to have Nameserver 3 & 4 pointed to another host where a 'Unplanned outage' page can be viewed.

Question #2: Would this solution work?

Question #3: Can I rely on the nameservers being queried in order (1,2,3,4)?

Question #4: Is this a horrible idea or frowned upon?

Byran Zaugg
  • 337
  • 1
  • 2
  • 10

3 Answers3

3

Your assumptions under "Currently I'd have to" are sound - note the DNS record propogation time is controlled in the SOA record in your nameservers - you can make it much shorter (look at the records for any prominent site and you'll see that they're generally short TTLs)

However, your solution wouldn't work because DNS servers aren't ordered. There's no 1,2,3,4.

One way I've handled this for a large website in the past was similar to what you described - with a failover component. DNS servers in primary datacenter, DNS servers in secondary hot-spare datacenter, when primary datacenter failed update the DNS to point WWW to secondary datacenter. There were commercial products to handle this automatically (BigIP 3DNS, hah) but it wasn't hard to script.

You could do something very similar on-the-cheap.

  • Get an inexpensive VPS and configure it as a secondary nameserver for your domain(s), and update your records with your registrar to make sure everybody knows about that nameserver.

  • Host a site outage page on your new DNS server.

  • Tweak TTL/Retry/Refresh numbers in your DNS SOA record to correspond to desired failover window.

  • If your primary site fails, update your DNS manually...(or automatically, if you can detect the failure reliably and script it...)

I'm sure others will have some suggestions on the (many) ways you could handle this.

quadruplebucky
  • 5,139
  • 20
  • 23
  • 2
    On the subject of TTL it's important to be aware that an ever increasing number of systems are ignoring it and caching the data for whatever period that system is configured for. This of course makes it harder to get a failover system using DNS to work as we might like. – John Gardeniers Mar 02 '10 at 21:18
  • I've heard (and suspected) that for a while - do you know of any studies to confirm that or any specific systems? – quadruplebucky Mar 02 '10 at 21:33
  • AOL has been the biggest offender in this area of YEARS. – Zypher Mar 03 '10 at 04:59
  • A concrete example: http://techblog.wikimedia.org/2010/03/global-outage-cooling-failure-and-dns/ – Bill Weiss Mar 24 '10 at 22:18
2

Take a look at AutoFailover.com

Snip from thier offering:

Autofailover

The mainstay of TZO-HA and the foundation for the high availability option is the unique capability of maintaining extraordinarily low cache times. This allows for near real time traffic redirection.

When TZO-HA detects a failure it automatically updates the DNS record for your domain so that the server requests are sent to the IP address of your alternate server or server cluster.

Unprecedented failover time

The maximum time to re-direct server requests is 2-1/2 minutes including failure detection, DNS record changes, and DNS propagation time through other DNS servers. Typically, this all occurs within 1 minute. Competitive offerings can only deliver time frames of 10 to 30 minutes or more. TZO-HA also Includes Multiple Failover modes.

Richard West
  • 2,978
  • 12
  • 44
  • 49
1

Doing that via DNS is a horrible idea. Not only will it take forever for your clients to get the hint that your IP has changed, but they'll then cache that you're down, even after you come back up.

What the big guys do is have a second site available (hosting the "we're down" page, or maybe just another copy of the site), and have some routers doing BGP in front of them. If one site goes down, packets magically go to the other site. When it comes back up, it has priority, and there you go.

That's expensive. You probably don't need it. If you do, well... get spending :)

Another option would be to host your main page off of a CDN (that presumably won't go down). If your site is hosed, flip them over to your "hey, things are bad, but they'll get better" page while you make your fixes.

Bill Weiss
  • 10,979
  • 3
  • 38
  • 66
  • Let me add a little data: I moved a domain a few months ago. Weeks before the move I dropped the TTL on the domain to something tiny. After the move, I was seeing connections going to the old IP for at least a month. Diminishing amounts, for sure, but some of them were clients who wouldn't have liked to get the "we're down" page for that time. – Bill Weiss Mar 03 '10 at 15:59
  • I'm not suggesting that DNS alone is the best way, but its certainly an essential component of anything "the big guys" do. – quadruplebucky Mar 04 '10 at 01:53
  • No, it's not, at least not in the scenario I'm talking about. BGP moves the _IPs_ around, so DNS doesn't have to change. – Bill Weiss Mar 04 '10 at 14:20
  • I must not understand your scenario. BGP doesn't do anything but communicate routes. It does not magically move IPs. – quadruplebucky Mar 12 '10 at 05:55
  • 1
    I have two sites, A and B. They both have a router that speaks BGP, and advertise the same IP range, with a higher priority on A. That route gets used, so packets go to A. Now A goes away due to some issue. Those same packets go to site B, which is my DR site. – Bill Weiss Mar 12 '10 at 17:17