Questions about single point of failure for small operations

Question

If you can't afford or don't need a cluster or spare server waiting to come online in the event of a failure, it seems like you might split the services provided by one beefy server onto two less beefy servers. Thus if Server A goes down, clients might lose access to, say email, and if Server B goes down, they might lose access to the ERP system.

While at first this seems like it would be more reliable, doesn't it simply increase the chance of hardware failure? So any one failure isn't going to have as great an impact on productivity, but now you're setting yourself up for twice as many failures.

When I say "less beefy", what I really mean is lower component spec, not lower quality. So one machine specification out for visualization vs two servers spec'd out for less load each.
Often times a SAN is recommended so that you can either use clustering or migration to keep services up. But what about the SAN itself? If I was to put money on where a failure is going to occur, it's not going to be on the basic server hardware, it's going to have something to do with storage. If you don't have some sort of redundant SAN, then those redundant servers wouldn't give me a great feeling of confidence. Personally for a small operation it would make more sense to me to invest in servers with redundant components and local drives. I can see a benefit in larger operations where the price and flexibility of a SAN is cost effective. But for smaller shops I'm not seeing the argument, at least not for fault tolerance.

score 7 · Answer 1 · answered Jan 29 '10 at 21:26

7

This all boils down to risk management. Doing a proper cost/risk analysis of your IT systems will help you figure out where to spend the money and what risks you can or have to live with. There's a cost associated with everything...this includes HA and downtime.

I work at a small place so I understand this struggle and the IT geek in me wants no single points of failure anywhere but the cost of doing that at every level is not a realistic option. But here are a few things that I've been able to do without having a huge budget. This doesn't always mean removing the single point of failure though.

Network Edge: We have 2 internet connections a T1 and Comcast Business. Planning on moving our firewall over to a pair of old computers running pfSense using CARP for HA.

Network: Getting a couple of managed switches for the network core and using bonding to split the critical servers between the two switches prevents a switch failure from taking out the entire data closet.

Servers: All servers have RAID and redundant power supplies.

Backup Server: I have an older system that isn't as powerful as the main file server but it has a few large sata drives in raid5 which takes hourly snapshots of the main fileserver. I have scripts setup for this to switch roles to be the primary file server should it go down.

Offsite Backup Server: Similar to the onsite backup we do nightly backups to a server over a vpn tunnel to one of the owners house.

Virtual Machines: I have a pair of physical servers that run a number of services inside of virtual machines using Xen. These are running off a NFS share on the main file server and I can do live migration between the physical servers if the need arises.

answered Jan 29 '10 at 21:26

3dinfluence

12,449
2
28
41

Thanks! But I'm really asking about using two servers over one without clustering or replication...essentially just splitting services across two servers. And if a NAS or SAN is used for storage, doesn't that just re-create the single point of failure? From a component standpoint certainly I'll always have redundancy (drives, etc). But that doesn't help when the RAID controller freaks and breaks the array. – Boden Jan 29 '10 at 21:45
Yeah I once lost a RAID5 array to a misbehaving circuit in the hot swap chassis screwing up the entire chain. That shouldn't be as much of an issue with the modern serial equivalents as it was with the old parallel buses. Eliminating the single points of failure is not going to be cost effective at the scale you're talking about. Unless the cost of a failure is extremely high which isn't likely. I do have one suggestion though...but I'll do that in another comment. – 3dinfluence Jan 29 '10 at 21:54
If you only had 2 servers you can do this. Assuming both servers have enough storage capacity/ram and support virtualization. You can setup Xen on both servers. Setup cron jobs on each of them to save the state of the virtual machines and copy the resulting file to the other physical machine nightly. That way if you do have a system failure you can get it back up and running quickly on the remaining hardware. Minus what ever changes happened that day at least. – 3dinfluence Jan 29 '10 at 22:01
That's an interesting suggestion. However that's likely to increase the cost of the servers dramatically. Each will have to be capable of running the load of the other (although perhaps with degraded performance). In you're going to spend that kind of money, then why not just have two identical servers with one as a hot standby? – Boden Jan 29 '10 at 22:13
This all goes back to the cost/risk management. You are in the best position to answer questions like: Is running your services in a degraded performance better than them being down? Are you willing to loose all the changes since the last snapshot? You may be able to get around that with some backup strategy. Getting to a point of no single points of failure is tough without the economy of scale working in your favor. Amazon Cloud may be an option. But virtualization is changing this but not quite there and maybe not with 2 servers. Projects like Sheepdog look interesting. – 3dinfluence Jan 29 '10 at 22:41

score 5 · Answer 2 · answered Jan 29 '10 at 20:11

I think this is a question with many answers but I would agree in many smaller shops the several server solution works and as you say, at least something keeps going if there is a failure. But it depends on what fails.

Very hard to cover all bases but redundant power supplies, good quality power and good backups can help.

We have used Backup Exec System Recovery for some critical systems. Not so much for daily backup but as a recovery tool. We can restore to different hardware, if available, and we also use the software to convert the backup image to a Virtual Machine. If the server fails and we need to wait for hardware repairs, we can start a VM on a different server or workstation and limp along. Not perfect but it can be up and running quickly.

score 3 · Answer 3 · answered Jan 29 '10 at 21:13

Regarding SANs: Almost anything you use will be redundant. Even if it's a single enclosure, inside will be dual power supplies, dual connectors, and dual 'heads', each with links to all disks. Even something as simple as an MD3000 sold by Dell has all these features. SANs are designed to be the core of your boxes, so they're built to survive just about any random hardware failure.

That being said, you have a point that redundancy isn't always the best option. ESPECIALLY if it increases complexity. (and it will) A better question to ask is..."How much will the company accept downtime". If the loss of your mailserver for a day or two isn't a big deal, then you probably shouldn't bother with two of them. But if a webserver outage starts losing you real money every minute, then maybe you should spend the time making a proper cluster for it.

score 2 · Answer 4 · answered Jan 29 '10 at 20:30

The more servers you have the more chances of something breaking, thats one way of looking at it. Another is if one breaks, you're up the creak 100%, also just like you are saying.

The most common hardware failure is HDs, like you were saying above. Regardless of how much you want to split operations between, you need to be RAIDing you storage.

I would vote for a couple servers (RAIDed of course) instead of one massive one, both for operations stability, and performance. Less software bumping into each asking for resources, reduced clutter, more disks to be read/written to, and so on.

score 2 · Answer 5 · answered Jan 30 '10 at 03:03

I would personally opt for multiple servers. I don't think equipment failure is more likely in this scenario. Yes, you have more equipment that could fail, but the odds of any given unit failing should be constant.

What having multiple servers in a non-redundant/non-HA configuration gives me is the ability to off-load some of the work to another server in the event of a failure. So, say my print server goes down. If I can map a few printers to the file server while I'm fixing the print server, the impact to operations is lessened. And that's where it really matters. We often tend to talk about hardware redundancy, but the hardware is only a tool for continuity of operations.

Well, your odds of winning the lottery are greater if you buy two tickets, even though it doesn't make much difference really. One server with a 6 hour call to repair might be less expensive than two, even when factoring in losses from six hours of full downtime. While I agree that some services can be moved quickly to a second server, the time required to move larger services might be greater than the time to repair the failed server. "Might" being the key word. It's an interesting problem. Thanks for responding! — Boden, Jan 31 '10 at 03:42

score 1 · Answer 6 · answered Jan 30 '10 at 08:33

I work in a small shop (one man IT department) and wouldn't swap my multiple servers for a single one under any circumstances. If any one of the servers goes down I have the option of either adding the now missing services to another machine or even just setting them up on a spare PC. We can live with an outage of an hour or two for most things but we can't live with a complete outage of all systems. While I can replace any of our servers with a PC, at least temporarily, I don't have, or can readily get hold of, anything anywhere near powerful enough to replace all the servers at once.

score 1 · Answer 7 · answered Oct 12 '19 at 08:56

Your original post hypothesize that you can't afford a cluster, but you consider solutions with two servers (not including backups). That would imply that you most likely have three server on hands, enough to start a cluster.

There are intermediate solutions that can avoid SPoF and still be appropriate in small/medium sized businesses : node-to-node replication without SAN storage.

This is supported for example by Proxmox (but I think it also is supported by XCP-ng/XenServer and probably by ESXi).

Let's consider a 3 nodes setup. All with RAID, redundant PSU, redundant network.

Node A and B have beefy CPU and lots of RAM.
Node C is more modest in CPU/RAM but has lots of storage and is used to provide quorum to the high availability watchdog, and host backups.

Then two options :

All VMs normally run on node A and are replicated on node B (requiring decent CPU sepcs)
VMs are split between node A and B, and replicated mutually some from node A to node B and from node B to node A.

This kind of setup can tolerate a network failure, a total and major node failure (any of the three), with a downtime of a about 1 minutes (roughly the time needed for a VM to boot up). The downside, is the lost of data since the last replication (which depending on your settings and hardware performances can be as low as 1 minute, and as high as a few hours).

With the 2nd option (VM normally split between node A and B), you have to prioritize which VM are allowed to come back online. Since, as your VM load is usually split between two server, having all of them running on a single node might exhaust the RAM of the node or congest the CPU.

Kyle Brandt · Answer 8 · 2010-01-29T21:52:24.407

0

"While at first this seems like it would be more reliable, doesn't it simply increase the chance of hardware failure?"

From a hardware standpoint I don't see how it practically increases the chances of failure. There are far to many variables here, and I have never studied probability, but to over simplify: Lets say Dell makes 1 bad server per every 100,000 they make. Your chances have changed from 1 in 100,000 to 2 in 100,000 (or 1 in 50,000). So yes, twice the chance, but still because of the scale the chances practically are not that different.
I think perspective is key here. "You're setting yourself up for twice as many failures." Maybe from your perspective, but in both the scenarios that you gave, email is running on one server and ERP is running on one server. So from the perspective of email or erp (which is what the business cares about), it is really the same. Unless they get lonely, or like their space ;-)
I think you should also look at it from a people standpoint. I think failure due to people mistakes is maybe more likely, and this way someone would probably only screw up one server at a time. It also makes it easier to identify problems with things like load. If both email and a website run on a server, extra time to find out where the problem is.

It is never this simple, big beefy servers may be better made or worse made. They may have higher quality parts, but maybe make more heat and are not cooled properly. A beefy server has more RAM, more CPU's etc, so in the end maybe you have just as many CPUs in both scenarios so maybe a server is not the right unit to think about.

Because of the complexity of the chances, whatever is most cost effective wins I think. If you have to pay for licenses 1 big server may be cheaper than a few smaller servers depending on the licensing structure.

edited Jan 29 '10 at 21:52

answered Jan 29 '10 at 20:49

Kyle Brandt

83,619
74
305
448

I think it does increase the chances of a hardware failure. 1/2 the MTBF, assuming both servers are the same and run the same amount of hours and load... – Scott Lundberg Jan 29 '10 at 21:29
Scott: Updated to explain a little more, I meant practically. Also, I really do think it is about perspective. – Kyle Brandt Jan 29 '10 at 21:39
Also, the servers are not the same... – Kyle Brandt Jan 29 '10 at 21:52
It does increase the chance of failure. A RAID0 with two drives is more likely to fail early than a single drive. Of course in that case you lose everything, so it's not completely analogous to the situation I'm describing: splitting your services onto two servers instead of running them all on one. The result of a single failure isn't as bad, but I now have more hardware that can fail. – Boden Jan 29 '10 at 21:57
Thanks for the update! I'm sorry and I should have qualified my question a little better, at least in terms of "beefy". What I'm talking about here is choosing between, say, one HP DL380 with dual processors, a ton of RAM, and 8 hard drives vs. two DL380s with single processors, less memory and hard drives, less controller memory, etc. (just an example...but assume the build quality of the "less beefy" servers is the same as the single "beefy" server) Yes, it costs more for two servers this way, but when does it become worth it? – Boden Jan 29 '10 at 22:07
I think the HD analogy works well, but the servers I have managed hard drives fail pretty often, everything else, not so much. But that is why RAID is always used on servers. So in my opinion, software interaction, people screwing up, etc, or more important influences. In the end my vote is generally for more servers (but it depends on the details), it is better for the business to have one thing fail than both. Also, it is easier to rebuild recover one thing than both if you have to rebuild or restore. Also, it is less stress for you on that day :-) – Kyle Brandt Jan 29 '10 at 22:14
@Kyle: Ok, I see where you are coming from. – Scott Lundberg Jan 29 '10 at 22:52

score 0 · Answer 9 · answered Oct 18 '19 at 06:04

My default approach is to avoid any centralized infrastructure. For example, this means no SAN, no Load Balancer. You can also call such a centralized approach "monolithic".

As a software architect, I'm working with the customer's infrastructure. That might mean using their own private data-center, or using something like AWS. So I don't usually have control over whether they use a SAN or not. But my software usually spans multiple customers, so I build it as if it will be run on individual machines in isolation on a network.

The Email Example

Email is weird, because it's a legacy system (that works). If email was invented today, it would probably use RESTFul APIs on web servers, and the data would be in a database that could be replicated using normal tools (Transactional Replication, Incremental Backups).

The software architecture solution, is that a Web Application would connect to one of a list of available nodes (at random), and if that's unavailable it will try to connect to another node (at random). A client might get kicked off a server, if it's too busy. Here, there's no need for a load balancer to connect through to a web farm; and, there's no need for a SAN for high availability. It's also possible to shard the database per-department or geography.

Commodity means...

So instead of having an expensive 1 or 2 servers and a SAN with internal redundancy measures, you can use several commodity low-power low-cost machines.

Simplicity - redundancy comes purely from the number of devices. You can easily verify your redundancy by the amount of machines. And you more correctly estimate they have a higher chance of failure and prepare for that.
Redundancy percentage - If you have 2 servers, if one fails you have 1 left (50%). If you have 10 commodity servers and one fails you have 9 left (90%)
Inventory - a commodity device is readily available from any nearby shop for a great price.
Compatibility - with fibre channels, and all kinds of standards for disk volume formats, commodity devices and software architecture means you are not locked into a single device model or brand.
Performance - With 2 devices on SAN, they need to be in the same room. With commodity machine approach, if you have 5 offices, you can have 2 in each office, with VPN WAN redundancy between offices. This means software and comms is on the LAN at <1ms access time.
Security - building on the high-level of redundancy, you can easily rebuild nodes as a commodity regular process. Want to rebuild a monolithic 2-server cluster? Get out the manual. By rebuilding machines often (with automation) you keep software up to date, and prevent any hacker or virus gaining a foothold on your network.

Note: You would still need to have multiple switch and gateway router redundancy

Questions about single point of failure for small operations

9 Answers9

Linked