I'm working with a complex system that has a couple of decades of history behind it. It used to be a client/server dispatching app, but it's gotten more complicated. Originally, each customer had his own instance, running on his own servers. Now, we have some customers who are still running in this mode, but we have some who are running in a Software as a Service mode - where all the applications are running on our servers. And we've added web interfaces, and we have hundreds of customers who access their systems solely through the web.
As it currently exists, each installation of the system consists of:
- a database: in which nearly every record has a primary key beginning with "customerid", so multiple customers can run against the same database.
- an installation directory: a directory on the SAN, in subdirectories of which exist the executables, log files, configuration files, disk-based queues, and pretty much everything else involved in the system that isn't a website
- background apps: a bunch of applications, located in a subdirectory of the installation directory, but which may be running on one or more application servers, which are responsible for communicating with various off-site systems, mobile users, etc. They can be configured to run as Windows Services, or run from the command-line.
- client apps: another bunch of applications, located in the same subdirectory, but running on any number of user machines, which which managers and dispatchers can interface with the system, dispatching work to the various mobile users, run reports on the work done, etc.
- web apps: a couple of web sites/applications/services, that allow dispatch users to perform certain dispatching functions, and to allow mobile users to complete their assigned work from any web browser. Generally, there's a many-to-one relationship between web sites and system installations. We'll have a number of sites on a number of server platforms configured to run against any specific installation of the system, and use a load-balancer to distribute incoming users across them.
We have upwards of a dozen different installations, each running from a one to several hundred customers. (And from a handful to several hundred users per customer.)
The older background apps are written in unmanaged C++, the newer in C#. The client apps are written in VB6, running against an COM object written in unmanaged C++. The websites and services are ASP.NET and ASP.NET/MVC written in C#.
Clearly, it's gotten quite complicated, over the years, with a lot of parts and a lot of inter-relationships. That it still works, and works well, surprises me. And makes me think we didn't do too bad, when we first architected the beginnings, 20 years ago. But...
At this point, our biggest problem is the effort needed to install updates and upgrades. Much of the system is decoupled, so we can change one communications program, or fix a web page, etc., without much difficulty. But any change to the database schema pretty much mandates a system-wide change. And that takes significant time, and affects many customers, and involves significant risk. So the implementation of fixes get delayed, and that makes the risk when we do do an upgrade even higher, which results in more delay, and it's generally hurting our responsiveness.
So, I'm looking for advice as to architectural changes we might make, that would make upgrades less risky and less expensive.
In my ideal world, we'd never upgrade a running installation, we'd install an upgrade in parallel, test it, and once we were confident that it was working, we'd move customers from the old system to the new one, at first one at a time, and then later in bulk as we grew confident. And where we could roll a customer back to the old system, if things didn't work. But I see some problems with that:
- We don't know what customer a user belongs to until after he's logged in.
- Moving a user from one system to another involves copying hundreds of thousands of database records, and applying schema changes in the process.
- Moving a custom from one system to another involves copying who knows how many files in our disk-based queues, and other assorted supporting files.
- Rolling back is, I think, necessary. But it'd going to be even more difficult.
What we have is working, but it's not working well. And I was hoping for some advice.
I'm not looking for answers, exactly, but more I'm looking for ideas on where to look. Anyone have any ideas on where I could find information on how to deal with structuring systems of this scale?