2

We're a small consulting shop that hosts some public facing websites and web applications for clients (apps we've either written or inherited). Our strength lies in coding, not necessarily in server management. However a managed hosting solution is out of our budget (monthly costs would exceed any income we derive from hosting these applications).

Yesterday we experienced a double hard drive failure in one of our servers running RAID5. Rare that something like this happens. Luckily we had backups and simply migrated the affected databases and web applications to other servers. We got really lucky, only one drive 100% failed, the other simply got marked as pending failure, so we were able to live move almost everything [one db had to be restored from backup] and we only had about 5 minutes of downtime per client as we took their database offline and moved it.

However we're now worried that we've grown a bit... organically... and now we're attempting to figure out the best plan for us moving forward.

Our current infrastructure (all bare metal):

  • pfSense router [old repurposed hardware]
  • 1U DC [no warranty, old hardware]
  • 2U web & app server (server 2k8 R2, IIS, MSSql, 24gb ram, dual 4C8T Xeon) -- this had the drive failures -- [warranty good for another year, drives being replaced under the warranty]
  • 4U inherited POS server (128gb ram, but 32bit OS only, server 2k3) [no warranty]
  • (2) 1U webservers (2k8, IIS, 4C8T Xeon, 4gb ram) in a load balanced cluster (via pfSense) [newish with warranty]
  • 1U database server (2k8, MSSQL, 4C8T Xeon, 4gb ram) [new with warranty]
  • NAS running unRaid with 3TB storage (used for backups and file serving for webapps to the 2 load balanced web servers)

Our traffic is fairly light, however downtime is pretty much unacceptable. Looking at the CPU monitors throughout the day, we have very, very little CPU usage.

We've been playing with ESXi as a development server host and it's been working reasonably well. Well enough for me to suggest we run something like that in our production environment.

My initial inclination is to build a SAN (following a guide such as this: http://www.smallnetbuilder.com/nas/nas-howto/31485-build-your-own-fibre-channel-san-for-less-than-1000-part-1) to host the VMs. I'd build it in RAID 1+0 to avoid the nasty hard drive failure issue we had yesterday.

We'd run the majority of VMs on the server that currently has failed hard drives as it is the most powerful. Run other VMs on the 1U servers that we've currently got load balanced. P2V the old out of warranty hardware (except pfSense, I prefer it on physical hardware). Continue running the unRaid for backups.

I've got a bunch of questions, but the infrastructure based ones are as such:

  • Is this a reasonable solution to mitigate physical hardware issues? I think that the SAN becomes a massive SPOF and the large server (that would be hosting the VMs) is another. I read that the paid versions of vmWare support automatic failover of VMs and that might be something that we look into to alleviate the VM Host failure potential.
  • Is there another option that I'm missing? I've considered basically "thin provisioning" the applications where we'd use our cheaper 1U server model and run the db and the app on one box (no VM). In the case of hardware failure, we'd have a smaller segment of our client base affected. This increases our hardware costs, rackspace costs, and system management costs ("sunk" cost in employee time).
  • Would Xen be more cost effective if we did need the VM failover solution?
Sean
  • 159
  • 6

1 Answers1

1

I have identified a number of questions.

  • What virtualization makes sense with us (XEN?)?

You are pretty much bound to W2K8R2. IMHO you should take a deeper look at Microsoft Hyper-V (and I am a Linux/Unix guy!). The licensing model might be attractive: Buy one, have three virutal servers free (if I remember correctly what our Windows guys say about it).

I am using XEN for PV-Linux virtualization based on SLES10 SP4. It works great and I really like it. But I have a W2K3 Server running there as well (full virtualized) and I want to get rid of it (-> Hyper-V).

  • How can we prevent such failures in the future?

Harddisk failures are the most common failures. Try to avoid using harddisks from a single vendor and try to avoid the same production month for disks from the same vendor.

There are not many vendors left on the market - though. So you should live-replicate your data as well - build more HA into you systems!

  • How do we setup HA properly?

KISS (keep it simple, stupid!) I believe in simple two-node-clusters. Divide your services into two, each box will one halv of the services, but each physical server should be able to host all services (maybe with a small performance degradation). In your setup try to replicate each VMs disks-data online to the other side.

  • How do we setup fast, cheap HA storage?

Use two DAS-boxes with the capability to connect two controllers (so two servers can connect to each box). Do a host-based mirroring between those boxes.

OR:

Put enough local storage into your servers and mirror this local storage host based between those servers (no clue if this works with Hyper-V, but I use Linux/DRBD8 for that purpose).

  • Is it a problem using unsupported hardware?

No, if you have enough replacement parts on site. Harddisks, RAM, Power-Supply, CPU, Network Cards - in that order - HD and RAM are most common.

Nils
  • 7,695
  • 3
  • 34
  • 73
  • Thanks Nils. I looked pretty heavily at hyperv and honestly felt that vmware was a better solution overall (definitely more mature). Our dev servers on esxi 5 have been rock solid. I dug more into HA setups and I think that our front end approach is reasonable but we need to figure out a strategy for the database layer (which likely won't include a san but maybe virtualization). – Sean Mar 11 '12 at 02:40