Best option for HA between remote datacentres?

Question

We're reviewing the system requirements that a new client has brought to us. They work in the medical field and the system they want us to host must be at a minimum highly available, and preferably fault tolerant.

We're looking at licensing VMWare Enterprise to get their HA and FT features with FT compatible hardware. No biggie - 10Gbs ethernet is coming down in price and 1Gb ethernet is a non-issue.

However one of the clients requirements is that we have a HA (not a FT, but FT would be nice) site that's in a totally different city.

Bandwidth in Australia is crazy expensive, and I don't even know if its possible to get a 1Gbps link between Sydney and Melbourne (approx 1000km/600 miles apart). I'd probably struggle to get a 100Mbps link between the two cities.

What are my options for providing a HA system? Doesn't have to be with VMWare, but if one site goes down I need to be able to log into the 2nd site and hit the Start button and be up and running.

@Mark if I still lived in Toowoomba (moved south 15 years ago) I could have it by now. But then I'd be living in Toowoomba... — Mark Henderson, Sep 28 '10 at 06:13
Is dark fiber an option? The optics won't be cheap, but you'd have all the bandwidth you can think of. — Joris, Sep 28 '10 at 07:00
@Joris - not really. Australia never really had the dot-com boom that the US experienced, so there's no excess capacity just lying around, and running your own that distance is prohibitively expensive — Mark Henderson, Sep 28 '10 at 11:50
eaischh... toowoomba. I think not having the NBN is a better deal :) — Mark, Sep 29 '10 at 02:41

score 5 · Accepted Answer · answered Sep 28 '10 at 04:55

I'm currently in the initial planning/implementation phases of getting a DR/HA remote datacenter set up for $WORK. Our current environment is already highly-virtualized, both in terms of server virtualization (VMware) and storage virtualization (Compellent SAN). As such, we thought it made most sense to leverage both of those technologies and the HA solutions they have available.

All of our VMFS volumes currently live on the Compellent SAN, and we'll be using their volume async volume replication to keep things relatively in sync (probably every 15 mins) on another SAN at the remote site. Then, for the VMs themselves, we're using VMware's SRM. Pretty cool product that I'm guessing you've heard of. If you want it to, it can pretty much abstract/automate failover to your secondary datacenter to a single button click. Pretty slick.

Fortunately for us, we have a GigE link between sites that we'll be using for SAN replication (among other things). When syncing every 15 minutes, the volume deltas in our case will not be that large. Depending on how much churn your client's systems have, it may not be all that difficult to keep things in sync over a 100Mbps link (or even smaller). I know of other Compellent customers that are syncing over a single T1. Obviously, there's not a whole lot of data change happening there...

Anyway, here are few things to take into consideration:

Be careful of your swap luns. They most likely don't need to be replicated. Sure, systems on the far end will need swap luns, but you can probably get those VMs re-mapped to local-only luns. This way, you won't need worry about the overhead of replicating useless data.
Make sure that your SAN vendor has some plan for failback. You don't want to be "stuck" with your production environment running at the failover site for months because you can't figure out how to get things back in sync at the main site.
This goes without saying, but test, test, test. VMware SRM makes this very easy to perform and can give you nice failover test reports to hand to the PHBs.
IP addressing. In our case, we have 802.1QinQ running between sites, so re-addressing servers when running at the failover site will (fortunately) not be necessary. This does require, though, BGP advertisements (with appropriate weights) from each site, as well as firewall rules to be maintained at each site so that when traffic swings over to the failover site, things will work as expected.

That's all the advice I have for now. In six months when I'm (hopefully) close to finalizing our DR system, I'm sure I will have learned many more things. :) Good luck and have fun!

Wow, awesome, thanks. Gives me plenty of food for thought. I had considered replicating at the SAN level but wasn't sure if it was going to be viable. I think we could get away with a 100Mb link after the 2nd site is initialised. Thankfully we're at *least* 6 months away from having to even place the first purchase order. — Mark Henderson, Sep 28 '10 at 06:10
I've seen VMWare SRM being mentioned in my VMWare bulletins, etc but I haven't really read into it much. It looks pretty darn good. — Mark Henderson, Sep 28 '10 at 06:11

score 2 · Answer 2 · answered Sep 28 '10 at 07:17

2

At $WORK we are evaluating VMWare SRM on HP P4000 SAN (AKA LeftHand).

answered Sep 28 '10 at 07:17

marcoc

748
4
10

Do you have the SAN doing the replication for you? – Mark Henderson Sep 28 '10 at 11:50
Correct. The link between the main and DR site is 100Mbps but the scheduled replica actually only uses a fraction of it. For disaster recovery purposes, you first have to define Recovery Point Objective (RPO) and Recovery time objective (RTO) to properly design a solution. – marcoc Sep 28 '10 at 11:56

Best option for HA between remote datacentres?

2 Answers2