0

Problem:

Should start off with saying that this is for a software house, and it's internal. None of the guys are "users", they are all staff.

  • We test on servers, including upgrading existing installations to prove that the upgrade process works etc.
  • People sometimes log into these servers to do testing of changes.
  • They don't put it back to how it is expected to be in production, meaning the environment is now considered "dirty".
    • Checking this is considerably more than "This app is installed correctly." This is much more about ensuring the following, as a subset, match production:
      • network interfaces and routing
      • configuration files
      • packages deployed to the server
      • the scripts on the server
      • VM config
      • disk usage, permissions, locations etc.
      • stuff I've not thought of
  • This costs time and money trying to find out why the action didn't work as expected.
  • For those who've helpfully suggested CI/CD to solve this (which I'm a big fan of and agree with for every other use case) a "wipe and redeploy" takes around 4 hours.

Question:

  • Is there a method of easily verifying that an installation is "as it should be"?
  • Note things like md5summing the disk isn't going to help as there are time dependent files on there.
  • Note that if someone monkeys around with the server, I don't care provided they put it back. Meaning that file timestamps are not going to help with this.

Before I get my hands dirty scripting an endless list of hash checks of all "essential" files, of which I am going to miss one or two at least and someone will natural change those and only those ones (all of the angry emojis) is there a better way of doing this that I can append to a build or upgrade script that will let me know if the installation is reliable or not?

Tom Newton
  • 93
  • 1
  • 8
  • 2
    What is your understanding of "as it should be"? The system is configured for specific purpose, the packages are not changed, the system is up to date.....? – Romeo Ninov May 11 '23 at 14:38
  • 1
    Can you give a concrete example ? If you configured an app on a server, that app config should not be changed by the users, they should only be able to use it ''as is''..... – yield May 11 '23 at 15:07
  • 5
    Although it comes with a big learning curve when you start from scratch: you want (both your test and production) systems to become more like cattle and less like pets. That means introducing automated deployments and centralised configuration management. - That makes wiping a test server to remove any and all test artefacts and re-installing to a particular baseline and consistent desired state quick and easy. That allows you to prepare for the automated upgrades and achieving a new desired state of your production environments. – HBruijn May 11 '23 at 15:44
  • Fully agree, you are absolutely right. However in this case for a fresh deploy we're looking at around 4-6 hours, whereas upgrades take around a quarter of that. – David Boshton May 11 '23 at 15:51
  • Are btrfs/ZFS snapshots something that may be considered? You would merely have to revert to a previous snapshot which would be a matter of minutes. – Ginnungagap May 11 '23 at 22:15

1 Answers1

0

To verify an installation is "as it should be," you could try AIDE. https://aide.github.io/

In my experience it takes a fair amount of configuration to identify files/paths that should be ignored because there are often changes, and you get a lot of false positives before you get everything set up right. But it's definitely better than trying to roll your own system integrity checker, which would run into the same problems.

divestoclimb
  • 101
  • 1