2

Looking at using DRBD or a clustered files system to help with up-time when downtime strikes in a small business environment.

We currently use a server box for a file server using Linux and samba, then running the web server and Database in a VM. Was looking at adding a second server and putting the files and the VM onto the distributed file system. The base OS is more static and easily can be managed more manually (copy config files at time of change, copy base OS if needed from full backups, etc)

Question is about the fail over scenario if manually done. If server 1 goes down and fail over is manually done, is fail over completed by simply setting the static IP of server 2 to server 1 (again server 1 is down and would be in a state of needing repair), starting Samba, and starting the VM which would have the same static IP's as they had when running on server 1, and starting the backup services?

This sounds like a quick and simple process, almost too simple. Am I missing something? This could easily be automated as well through a script or something that someone with little proficiency could be directed to run in the event of a failure.

Down time if we have a hardware failure could easily be days without the support of on call IT support and the parts needed without a second server, but with the the second server, down time would be at the maximum a matter of hours (if no one is the office proficient enough to perform such operations, minutes if someone was)

Damon
  • 429
  • 2
  • 12

1 Answers1

3

The failover process you're describing is as simple as it is correct. Using DRBD is the key step to create redundancy, as you eliminate a single point of failure like a shared storage.

The current failover you're mentioned can be easily automated by Pacemaker/Corosync so that theres no need for manual intervention. I would this prefer to self-written scripts, as it also takes care about fencing unfunctional nodes, so that you don't run into a split brain scenario (which could screw up all your data).

Keep in mind, that "real" HA requires complete (or at least maximum archivable) separation of systems (separate room (or at least rack), different USV, redundant switching etc.). Single point of failures usually screw up you're whole effort to optimise availability.

Henrik
  • 698
  • 5
  • 19
  • Definitely would like to move to a true HA setup but we are small enough where a little downtime is OK, but not days. If we have a second server ready to go that can carry us, then we do not need to make rash emergency calls or buy parts from a high price source in town if we can get a deal else where but might have to wait. Also, some things take a bit of time to diagnose. This would allow us to keep cost down during such times and still stay up and running. I am sure we will eventually use Heartbeat, Pacemaker, etc. But for now, only have time for the next baby step. – Damon May 15 '15 at 02:45