Afaik ansible offers no possibility to safely exit a running playbook, so I am wondering which strategy would be wise to not leave hosts in inconstistent states:
Lets picture an infrastructure of ~300 database hosts and a weekly downtime of only few hours. On wednesday we want to update some of them, running a playbook (note serial!):
- name: patchDB.yml
hosts: dbservers
serial: 10
tasks:
updatedbs [...]
would take too much time, so at one point I eventually would have to abort the play if maintenance downtime runs out.
I picture 3 options:
- include a pause with prompt after a completed batch (serial: 10) to continue with next batch (if time left) or abort; PB no longer fully automated
- timed pause (fully automated PB but you'll have to sit through another batch if pause window missed)
- small custom inventory/group for any maintenance window so total runtime becomes predictable, e.g. 3h available, estimated patch duration per batch = 20min, serial: 10 max. possible (load on controller) >> 180min / 30min = 6 Batches * 10 = 60 hosts total per maintenance window defined per custom inventory
Ofc I am aware Ansible offers plenty of custom failure conditions and meta action plugins, but simply having a limited maintenance time cant really be tied to any checks within the playbook. I do miss a builtin ansible feature like "aborting via ctrl+c telling the play to finish the remaining batch, then exiting gracefully".
I am heavily leaning towards (3), since creating custom inventories isnt such a hassle (just grep whatever you want from main all-inventory). Do you guys have any better practical ideas or experience in large scale patch deployment practice? Cheers!