Safely abort Ansible playbooks when server downtime runs out

Question

Afaik ansible offers no possibility to safely exit a running playbook, so I am wondering which strategy would be wise to not leave hosts in inconstistent states:

Lets picture an infrastructure of ~300 database hosts and a weekly downtime of only few hours. On wednesday we want to update some of them, running a playbook (note serial!):

- name: patchDB.yml
  hosts: dbservers
  serial: 10
  tasks:
    updatedbs [...]

would take too much time, so at one point I eventually would have to abort the play if maintenance downtime runs out.

I picture 3 options:

include a pause with prompt after a completed batch (serial: 10) to continue with next batch (if time left) or abort; PB no longer fully automated
timed pause (fully automated PB but you'll have to sit through another batch if pause window missed)
small custom inventory/group for any maintenance window so total runtime becomes predictable, e.g. 3h available, estimated patch duration per batch = 20min, serial: 10 max. possible (load on controller) >> 180min / 30min = 6 Batches * 10 = 60 hosts total per maintenance window defined per custom inventory

Ofc I am aware Ansible offers plenty of custom failure conditions and meta action plugins, but simply having a limited maintenance time cant really be tied to any checks within the playbook. I do miss a builtin ansible feature like "aborting via ctrl+c telling the play to finish the remaining batch, then exiting gracefully".

I am heavily leaning towards (3), since creating custom inventories isnt such a hassle (just grep whatever you want from main all-inventory). Do you guys have any better practical ideas or experience in large scale patch deployment practice? Cheers!

score 0 · Answer 1 · answered Mar 29 '23 at 15:53

However you batch the hosts up per plays and with serial, there are tradeoffs between doing more in parallel and going faster, and doing fewer in small batches but having less impact.

Your option #3, estimating runtime, could be improved by checking whether the window has expired before starting each batch. Including a buffer of estimated time a batch takes.


---

- name: Very first play
  hosts: localhost
  gather_facts: false

  tasks:
    # Unfortunately cannot keep actual datetime objects
    # as Jinja converts them to strings
    # Using set_fact to not lazy evaluate; need the time now not later
    - name: playbook start!
      set_fact:
        playbook_start: "{{ now().timestamp() }}"

    - debug:
        var: now
        verbosity: 1

    # TODO Consider checking whether now is in a downtime window on some calendar

- name: Time window respecting play
  hosts: localhost,127.0.0.2
  gather_facts: false
  serial: 1

  vars:
    # Configuration, in seconds
    # Realistically would be much longer
    # Total duration for all hosts:
    downtime_planned_duration: 5
    # Estimate of time for each batch to take:
    downtime_buffer: 3

  pre_tasks:
    - name: host start!
      set_fact:
        host_start: "{{ now().timestamp() }}"

    # Default behavior of failed hosts is to stop and not proceed with play
    # Checking this before doing anything means
    # hosts are not interrupted in the middle of their work
    # Failed hosts can be reported on, and run again such as with retry files
    - name: check if time window expired
      assert:
        that: "{{  (playbook_duration | int) + (downtime_buffer | int)  < (downtime_planned_duration | int) }}"
        success_msg: "Still in time window, proceeding with host"
        fail_msg: "Insufficent buffer in time window, not starting host"
      vars:
        playbook_duration: "{{ (host_start | int) - (hostvars['localhost'].playbook_start | int) }}"

  tasks:
    # Do work here, run some roles
    - name: sleep a bit to simulate doing things
      pause:
        seconds: 3

Unfortunately, when implemented as a play, this is a few lines of time math nonsense that's not easy to reuse. In theory this could be written as something like a callback plugin, and automatically triggered on play events.

Thats an interesting ansible-only approach, I was thinking of a similar REST API call if(downtime left for host){do stuff} but this obviously requires adequate endpoints. Your idea doesnt seem too complicated aswell, I am going to look into it, thanks for the inspiration. — Aguirre23, Apr 03 '23 at 13:06

Safely abort Ansible playbooks when server downtime runs out

1 Answers1