Server / service reboot order

Question

I have a couple of servers that run several services. These services are depending on the availability of services on other servers.

Example

ServerA/webservice is depending on ServerB/sqlservice and ServerC/blobservice

When scheduling updates / reboots I want to make sure that the servers and services are started in the correct order.

In this example

ServerA, B and C can all reboot at the same time but the services must be started in this order:

ServerC/blobservice
ServerB/sqlservice
ServerA/webservice

I know that services can be dependent on other local services which makes sure they start in the correct order. How can this be achieved across multiple servers?

score 2 · Accepted Answer · answered Mar 06 '19 at 10:35

There is no out of the box tools for doing this. Everyone I know who does this rolls their own. At Stack Overflow we have our own software called (very creatively) "patcher" which takes care of this for us.

The reason you'll be unlikely to find a generic turnkey solution is that everyones needs are so different. For example:

Do you need to check the health of the service before restarting a computer?
- What if you took out the last computer in the cluster?
- What if the service was not healthy before you rebooted it. When it comes back you may not know if is broken because of the recent patches, or if it was broken already
- What if you have hosts that are disabled, but will automatically re-enable after a reboot due to startup scripts?
- How do you validate the health of the service and the server post-reboot?
Are there special steps that need to be taken prior to a reboot?
- What if a load balancer needs to be adjusted?
- Do you need to flip a virtual IP to another host before the reboot?
- Is there a specific process for removing a server from a cluster safely?
What do you do if you patch one machine in the cluster and it doesn't come up nicely?
- Do you blindly continue patching and cross your fingers it wasn't a patch that broke it?
- Do you stop all patching, even if it was an isolated incident?
- Do you raise an exception and ask for input?
What operating systems are you working with?
- Do you need to do hetereogenous patching schedules?
- What kind of patch release are you doing? Do you patch as soon as the patches are on the market? Do you wait to see if any are withdrawn?
- Do you need to exclude certain packages/software from being patched?
What happens if the patching server needs patching and rebooting?
- Do you stop patching alltogether if the patcher server stops running?
- How do you self-test?
Which machines belong in which clusters/cadences/cohorts?
- Some machines can be patched in parallel. Some need to be in serial
- Some can be patched fairly soon after the others
- Other services can take several hours to rebalance (elasticsearch) so patch fewer machines per day

This is just a short list of the problems we worked to overcome with our patching/rebooting solution, and someone elses list is going to look completely different.

Server / service reboot order

Example

In this example

1 Answers1