5

Can anyone tell me if it is possible to pool several physical servers to run a resilient virtualization environment. Our servers are getting more and more critical to our clients and we want to do everything we can to improve resiliency in the event of a hardware failure. I have used desktop VMs but I am not familiar with what is possible in enterprise level VMs.

The ideal would be to have a few physical servers in our datacenter. A few VMs would be shared among these to run a web server, application server, and database server. If one physical server failed, the VMs should switch to one of the other servers and continue running without any interruption.

Can this be accomplished? I realise that even Google goes down from time to time, so I am not looking for perfection; just an optimal solution.

Kev
  • 249
  • 1
  • 10

4 Answers4

6

This is an excellent reason to virtualize. As application availability, rather than individual (physical) server uptime, become more important to businesses, many organizations find that they can attain a higher level of reliability through virtualization.

I'll use VMWare and Xen as examples, but with some form of shared storage that's visible to two or more host systems, virtualized guests can be distributed and load-balanced across physical servers. The focus begins to be the quality of the shared storage solution, management and the networking/interconnects in the environment.

However, one bit of caution... You should evaluate what type of hardware and environmental situations pose a threat. Quality server-class equipment includes many redundancies (fans, power supplies, RAID, even RAM)... Modern hardware does not just fail often. So avoid overreacting by building an unnecessarily-complex environment if spec'ing higher-end servers could help eliminate 90% of the potential issues.

ewwhite
  • 197,159
  • 92
  • 443
  • 809
  • I just don't see how that answers the question that was asked. – John Gardeniers Aug 18 '12 at 23:06
  • 1
    @JohnGardeniers Businesses virtualize to reduce the impact of hardware failure on application availability. But having a single server "fail" totally and completely is also a rare occurrence... so some analysis is necessary to understand the pain-points of the OP's environment. – ewwhite Aug 18 '12 at 23:37
  • Not sure I want to rely on hoping the servers wont just fail. I just want to be able to recover from a failure as quickly as possible. If virtualization is the best way to do that at the moment, I'll go with it. If not, I'm open to other ideas. For the record, I would never spec a server without RAID and dual power supplies. – Kev Aug 19 '12 at 19:58
  • 2
    @kryptonite Virtualization abstracts the hardware, so your guest operating systems/virtual machines become more portable. For example, in a [2 or 3-host VMWare setup with shared storage and vMotion](http://www.vmware.com/files/pdf/products/vsphere/VMware-vSphere-Essentials-DataSheet-DS-EN.pdf), your worst-case outage in the event of an individual server failure is that the virtual machines running on that server would quickly restart on a healthy host. Contrast that to having to find spare hardware, recover or restore data. – ewwhite Aug 19 '12 at 20:22
4

It sounds like VMware FT might be what you're looking for. It keeps a "shadow instance" of each virtual machine in lockstep with each source VM and allows for instantaneous failover between the two instances. More here:

http://www.vmware.com/products/fault-tolerance/overview.html

joeqwerty
  • 109,901
  • 6
  • 81
  • 172
  • Looks good. I guess it would only make sense if each mirror was on a different physical box? – Kev Aug 19 '12 at 19:53
  • 1
    Yes, which is essentially "pooling" your physical servers. By maintaining consistent shadow copies of your VM's across multiple physical servers, the applications running on those VM's become resilient and immune to failure of any one physical host. – joeqwerty Aug 19 '12 at 20:14
3

The any interruption part is quite an ask, specially that today you're going from what appears to be standard servers with no resiliency?

Virtualisation is an option but for the sake of full disclosure you should make an informed decision between the following,

  1. Small interruption, in the order of a few mins.
  2. No interruption (we're talking miliseconds).

(2) is normally very,

  1. Expensive - you need N+N hardware capacity. I.e. for every server you're running, you have a full standby server running the exact same software ready to take over in case of a hardware failure.
  2. Restrictive - the software you use for that ensures that the machines are "in sync", normally over ethernet. That means that if you're network slows down, it will slow your application down to ensure things remain in lockstep. To ensure that doesn't happen those machines have to be in the same Datacentre to get any kind of performance.

Virtualisation with VMware-FT is on solution. Xen has its equivalent with everRun, and there is the bare metal equivalent (no hypervisor).

(1) may well be all you need (Clustering)

  1. Depending on the application this can offer equal failure to (2). E.g. NFS servers like NetApp can offer a seamless failover, and clients continue with no failures and only a brief interruption.
  2. "Slightly" more tolerant of software failures. Because none deterministic CPU instructions are not in lockstep, a number of bugs like race conditions won't be triggered.
  3. Could allow you to run different versions of the software. For e.g. upgrade Node 1 of cluster to service pack 1 of Windows Server 2008, confirm its ok, upgrade Node 2 to Service Pack of Windows Server 2008.

I don't mean to sell clustering vs fault tolerance, or bare metal vs hypervisor, but when it comes to High Availability hopefully the above illustrates a large number of questions you need to answer first before implementing it.

  1. What is the maximum downtime tolerated by users (be realistic)
  2. What are the outage domains you will tolerate? Physical server? Software? Layer 2 network? Layer 3? Datacentre?
  3. What are the performance requirements of the application? Virtualisation is not for everything, and only very recently that clock sensitive applicaitons like Active Directory were accpted on Virtual Machines (and it is certianly not common practice). Regardless of whether you use the latets hypervisor and chipsets, virtualisation will still mean a hit on performance, throughput, and latency.
  4. Budget tha you need to work within.

These requirements can be translated to things like MTTF, and depending on budget and skillsets of your team, some solutions will just be a no go.

M Afifi
  • 727
  • 4
  • 7
2

It doable, and we do something similar, just without the automatic part.

As @ewwhite pointed out, the key is having a shared storage pool that visible to multiple host servers, so if one host goes down, it doesn't much matter a lot, because another host can take over. Setting up the kind of unnoticeable, interruption-free automatic failover you're asking about is not easy (or cheap), and frankly a lot more trouble than it's worth, at least for the vast majority of use-cases out there. Modern hardware doesn't fail a lot, unless it's set up really badly, so you'll get more mileage out of making sure it's set up right and in an environment that's within the operational ranges of the equipment.

We use the fail-over and high availability functions of our systems for only two things, really. The first is in disaster recovery (if our main site loses power or explodes, or what have you, we have the critical parts are mirrored at a second facility) and the second is in avoiding maintenance windows. We use blade servers, and ESX/vSphere and between having the ability to fail-over to a secondary site, and the ease of using vMotion to move VMs between hosts, there's very little that we can't do without a service interruption.

I would focus on getting that set up first - once you're able to (manually) fail things around to where-ever, you may decide that getting it work automatically is more expensive and difficult than its worth. It sounds easy enough and great in theory, but in practice it can be a real pain to get everything working properly in clusters or in a distributed-guest set up.

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209
  • This is the sort of information I'm looking for. I'm glad you have mentioned specific technologies. I will look into vMotion. I wonder how cloud providers do it - how do they split virtual OSes between many physical servers? That is what I would like - a stack of servers, with a storage array, and some distributed management software. I have thought moving it to the cloud, but there are some issues with that at the moment and I don;t know if it would work with our setup. – Kev Aug 19 '12 at 19:36
  • 1
    Most of the virtualization suites have some form of "live migration", similar to [VMWare's vMotion](http://www.vmware.com/products/vmotion/overview.html). That's a pretty key feature. So what is [the cloud](http://en.wikipedia.org/wiki/Cloud_computing#Characteristics)? It's just virtualization on a larger scale, with some self-service, management and automation aspects. – ewwhite Aug 19 '12 at 21:50
  • VmWare's services are a bit confusing to me. For example, what is the difference between HA and vMotion? I have also marked this post as the answer as it was the first one that gave specific info. Although the others have been very useful too. I think I will start a new topic about comparisons between different services. – Kev Aug 20 '12 at 19:31
  • 1
    @Kryptonite Basically, vMotion allows you to move a guest VM from one host to another (manually) without interruption. HA/High Availability is an ESX option that will start up another instance of your VM in the event that the host it's on fails. (It uses vMotion to do this.) And then there FT/Fault Tolerance, which keeps a second copy of your VM running on a separate host (and keeps everything synced up), so that if one host fails, the secondary VM can take over instantly with the same system state and memory contents as the failed VM. – HopelessN00b Aug 20 '12 at 19:45