3

I recently came into managing a small startup. As most small startups, I would think, we have been doing what we wanted in production virtually when we thought it was okay. People are careful and things have worked very well. We have also been able to resolve things very quickly which the clients are very grateful for.

However yesterday we had an issue where an admin, on their own, decided to change a server name and update software to get it more in line with things. The devs were notified however the name change killed our message queue system which in turn basically shut us down for hours. From this there was a series of cascading failures and the VM hosting the message queue actually had to be killed and a new VM created. No one was pleased.

This should have been verified in a non production environment first.

I was wondering what maintenance is allowed in production during business critical times? Some I would imagine however how much?

Telavian
  • 133
  • 5
  • 1
    `I was wondering what maintenance is allowed in production during business critical times?` - Usually none, unless it can be done with no impact to users and customers. Also, this wasn't maintenance, this was a change. This should have gone through some type of change management process or approval chain. – joeqwerty Jan 22 '16 at 16:28
  • "During critical times" is the key phrase here. Lots of companies (including a previous employer) would have a total change freeze during critical year-end times because of the critical processing that had to happen. If an admin is confident that a change won't impact anything, then it does, it's time to 1) educate the admin, or 2) replace the admin. Repeat offenders have no place in a production support environment. – Tim S. Jan 22 '16 at 18:01
  • Why do you only have one server for that service? Single Point of Failure elimination should be very high on your priority list. – Tom O'Connor Jan 22 '16 at 18:53
  • This is actually a great question. Sure enough there isn't a definitive answer but different approaches will work for different settings. Let's hear them! I for one completely disagree with @joeqwerty I strongly recommend to to as much maintenance as possible during "production hours". Only then can you now that your system is failsafe and actually exercise the task. Also you usually have all people ready to thelp if required which usually is not the case off-hours. I do agree thou it should have been tested properly (whatever that means for your organization) – serverhorror Jan 28 '16 at 17:56
  • @ServerHorror Thanks for the reminder... we need to push for a downvote comments feature. – HopelessN00b Jan 28 '16 at 19:14

2 Answers2

4

Maintenance can be done at any time so long as it doesn't impact business systems.

In your case of issues that caused a critical failure, the issue wasn't that it couldn't be done, its that either you have no notification of change process or the admin didn't follow it. The fact that there was a name change was not communicated to the people responsible for the uptime of the service. If the admin is the service owner (and in a small business that's very likely) then his suitability for that role needs to be examined as that's his job to determine the impact of any change affecting his service.

Test environments are fine but unless rigorously maintained are not going to prove out every issue. While certainly testing out changes in test is a best practice, it's no substitute for a back out plan (that should also be tested).

Lastly another lesson to be learned here s that developers are not admins. I suspect that as you said "devs were notified". I'll bet a nickel they were not asked "what happens if the machine name changes?". I would have at least had an email in hand from devs stating the changing the machine name would have no bearing on the app.

Jim B
  • 24,081
  • 4
  • 36
  • 60
  • 1
    Totally agree here. Test environments are worth gold if they're maintained and actually used properly, and "telling a dev" doesn't mean anything, because they are devs, not admins, and usually can't/don't see the big picture and are familiar with back-end systems that may be critical. – Tim S. Jan 22 '16 at 18:04
2

You learn from the mistake and take steps to analyze the impact of environment changes before making them.

Documentation goes a long way here, but also try to evaluate why this type of change would have such a wide-reaching impact. Were there things hard-coded in applications? Is there room for improvement in how the system functions?

A hostname change isn't a small thing, but isn't something that should completely break you either.

ewwhite
  • 197,159
  • 92
  • 443
  • 809
  • Thanks for the ideas. It was more to do with updating the message queue system. However the problems all seemed to combine into causing major pain. – Telavian Jan 22 '16 at 16:08
  • That can happen. Really, I've definitely seen fragile environments. But this is how companies define policy. I've had some places that are too conservative to allow changes... I've had others that allow it, but assume a certain level of caution. This is about communication and knowledge sharing. – ewwhite Jan 22 '16 at 16:09