1

In my current job, there is an error in the live code base which has been identified in one service.

We have identified a relatively small code change which would fix the issue and it has been confirmed to work in a test environment.

But, as this service is quiet old, with plans to phase it out over the next 12 months or so, and migrate everything towards the newer services, there has been an architectural decision made to make no more changes to the current service (with exceptions for extreme cases which are minor config changes, but our fix is being classed as a bigger change)

The alternative fix, is to migrate and redevelop the existing code to the new service, however this is a much larger chunk of work, which will need to be more extensively tested etc. And will also mean that the live production errors will remain until this work is done

I'm trying to understand, has anyone encountered something like this before, and what reasons would there be in architectural terms to not fix some code which is currently in your live system?

KeithC
  • 436
  • 1
  • 6
  • 25

2 Answers2

2

The consideration of time spent working on the fix may be negated against the time it takes to implement and solve the problem using the new service.

The architects may decide that the time would better spent developing a new service which is more robust, (as you say it will be migrated soon anyway) rather than working on the same thing twice only in two different ways.

Another factor to take into consideration is that if the current code base is old and difficult to work with, then the fix which you mention which works, is there anything to suggest that without a full suite of regression testing being completed ( also meaning more time and effort spent on something which will be phased out soon) that may actually end up breaking even more of your system?

Pat co
  • 36
  • 3
1

If the risk of implementing a fix outweighs the reward, it doesn't make sense - i.e. if the error affects 1% of the users 1% of the time, but a fix will risk hours of downtime that affects 100% of users. Unless no-one is using it anyway, in which case the deployment will be wasted effort.

However, given that a couple of things are in place, I see no reason to leave broken code in a production environment. These things, in my opinion, are:

  1. Automated deployments through all environments - so the exact sequence of steps to deploy working code to the test environment can be executed in production. Anything manual introduces the possibility of mistakes.
  2. Continuous integration pipeline with decent test coverage - this means that you know that the fix doesn't potentially break anything else, so again, minimize the risk of deploying it.
  3. A smoke test in the production environment to ensure that everything works after changes are deployed.

I'm sure there's a good reason for the architectural freeze (or it might just be political) - but if a team is afraid of deploying changes because of the risk involved, it should trigger warnings bells. Again, I'm not saying it's the case here - just a general comment - but if it comes down to a lack of faith in the quality of the system and the deployment process, there are likely some things that need to be revisited. Some of the big players in the industry (think Facebook, Twitter, and similar companies) deploy multiple times a day - because they have a solid process that allows them to safely do that.

Riaan Nel
  • 2,425
  • 11
  • 18