I'm currently in the initial planning/implementation phases of getting a DR/HA remote datacenter set up for $WORK. Our current environment is already highly-virtualized, both in terms of server virtualization (VMware) and storage virtualization (Compellent SAN). As such, we thought it made most sense to leverage both of those technologies and the HA solutions they have available.
All of our VMFS volumes currently live on the Compellent SAN, and we'll be using their volume async volume replication to keep things relatively in sync (probably every 15 mins) on another SAN at the remote site. Then, for the VMs themselves, we're using VMware's SRM. Pretty cool product that I'm guessing you've heard of. If you want it to, it can pretty much abstract/automate failover to your secondary datacenter to a single button click. Pretty slick.
Fortunately for us, we have a GigE link between sites that we'll be using for SAN replication (among other things). When syncing every 15 minutes, the volume deltas in our case will not be that large. Depending on how much churn your client's systems have, it may not be all that difficult to keep things in sync over a 100Mbps link (or even smaller). I know of other Compellent customers that are syncing over a single T1. Obviously, there's not a whole lot of data change happening there...
Anyway, here are few things to take into consideration:
- Be careful of your swap luns. They most likely don't need to be replicated. Sure, systems on the far end will need swap luns, but you can probably get those VMs re-mapped to local-only luns. This way, you won't need worry about the overhead of replicating useless data.
- Make sure that your SAN vendor has some plan for failback. You don't want to be "stuck" with your production environment running at the failover site for months because you can't figure out how to get things back in sync at the main site.
- This goes without saying, but test, test, test. VMware SRM makes this very easy to perform and can give you nice failover test reports to hand to the PHBs.
- IP addressing. In our case, we have 802.1QinQ running between sites, so re-addressing servers when running at the failover site will (fortunately) not be necessary. This does require, though, BGP advertisements (with appropriate weights) from each site, as well as firewall rules to be maintained at each site so that when traffic swings over to the failover site, things will work as expected.
That's all the advice I have for now. In six months when I'm (hopefully) close to finalizing our DR system, I'm sure I will have learned many more things. :) Good luck and have fun!