11

We are starting to look at Ansible to replace an old cfengine2 installation. I have a simple playbook that:

  • copies a sudoers file
  • copies a templated resolv.conf (fed with group_vars and host_vars data)
  • checks a couple of services are running
  • checks for presence of a local user

The playbook takes over 4 minutes of wallclock time to run against 97 machines (all connected over fast 1gig or 10gig networking, with sub-1ms LAN latency) and consumes over 50% of CPU on the 2-core 4G memory VM when I'm running it.

It takes about 11 seconds to run against a single machine, with about 4sec of user+sys CPU time consumed, which TBH still seems a bit excessive for the amount of work involved.

The obvious bits:

  • I have pipelineing explicitly enabled in a playbook-dir local ansible.cfg
  • I have fact caching to jsonfile enabled, same local ansible.cfg
  • I have forks set to 50, same (I have tried other values)
  • I am sure that Ansible is using SSH not Paramiko and it is using the persistent control socket - I can see the SSH processes being started and persisting during the run.

Is this level of performance normal or is something wrong with my setup? How can I go about determining what, if so?

Edit: As of Aug 2017, we're still seeing this problem. Ansible version is 2.2.1 and the playbook size has grown now. Up-to-date numbers:

  • 98 hosts
  • ansible -m ping all takes 4.6s real, 3.2s user, 2.5s sys times
  • a full playbook run takes 4 minutes, using 100% user and ~35% system CPU while doing it (on a 2-core VM deployment sever, 100% being one full CPU)
  • target OS is largely CentOS 7, some CentOS 6
  • profiling does not reveal any specifc task hotspots AFAICT

Although the playbook is now much bigger, I still don't think there is anything in there to justify that level of CPU load on the playbook server - wallclock time, perhaps, but the deployment server should be largely idle for most of the run, as far as I can see, it's mostly file copies and some template expansions.

Note we are making quite extensive use of host/groupvars

Several people have asked about profiling, tail of a run with profiling:

Tuesday 01 August 2017  16:02:24 +0100 (0:00:00.539)       0:06:22.991 ******** 
=============================================================================== 
yumrepo : centos repos -------------------------------------------------- 9.77s
sshd : copy CentOS 6 sshd config ---------------------------------------- 7.41s
sshd : copy CentOS 7 sshd config ---------------------------------------- 6.94s
core : ensure core packages are present --------------------------------- 6.28s
core : remove packages on VM guests ------------------------------------- 5.39s
resolv : stop NetworkManager changing resolv.conf ----------------------- 5.25s
yumrepo : epel6 gpg key ------------------------------------------------- 3.94s
yumrepo : epel7 gpg key ------------------------------------------------- 3.71s
yumrepo : nsg gpg key --------------------------------------------------- 3.57s
resolv : build resolv.conf ---------------------------------------------- 3.30s
yumrepo : nsg repo ------------------------------------------------------ 2.66s
resolv : check NetworkManager running ----------------------------------- 2.63s
yumrepo : psp repo ------------------------------------------------------ 2.62s
yumrepo : ucs repo ------------------------------------------------------ 2.44s
yumrepo : epel repo ----------------------------------------------------- 2.27s
resolv : check for nmcli ------------------------------------------------ 2.08s
core : remove various unwanted files ------------------------------------ 1.42s
telegraf : write telegraf.conf file ------------------------------------- 1.13s
core : copy sudoers in place -------------------------------------------- 0.94s
core : ensure sshd is running ------------------------------------------- 0.90s
user53814
  • 396
  • 1
  • 9
  • 4
    Make some profiling with `ANSIBLE_CALLBACK_WHITELIST=profile_tasks` and for more thorough debugging with `ANSIBLE_DEBUG=1`. Also pay close attention at initial ssh-connection speed. – Konstantin Suvorov Dec 02 '16 at 20:42
  • Agree with @KonstantinSuvorov 's comment- there could be a single host in the lot that is taking considerable time on a certain task. Once you isolate the task with profile_tasks you can examine running just those tasks on your hosts and seeing which one tasks the longest. You can also run a trivial task, like "command: w" against all of the hosts to see that it takes an expected amount of time. – andyhky Jan 19 '17 at 01:28
  • 1
    Check entropy starvation. `watch cat /proc/sys/kernel/random/entropy_avail` while the playbook is running. If its less than say 1000 you have a potential issue; if its less than say 64 and doesn't recover, than you have a definate entropy starvation issue. (common in some VM environments). This applies to your management server and also the nodes you are managing. – Cameron Kerr Apr 26 '17 at 23:50
  • On my management VM with 4GB RAM, I have forks=20 and pipelining = True. `ansible -i all all -m ping` against over 300 hosts (mostly VMs) took rather less than 1 minute. Is your playbook doing anything to change user (become / sudo / etc.). What does '-m ping' perform like? I would, based on experience, say you want to have more memory for 50 forks. – Cameron Kerr Apr 26 '17 at 23:55
  • what is your target operating system? – xddsg Jul 26 '17 at 16:40
  • CentOS 6 & 7 are the target OSes. – user53814 Jul 27 '17 at 15:49
  • To answer earlier comments, there's no entropy starvation going on, the task profiling does not show one particular host or task eating the time, and an -m ping all takes about 6.5s wallclock, 4.7user, 2.9sys. – user53814 Jul 27 '17 at 16:00
  • @user53814 Based on your profiling, are you running hundreds and hundreds of other tasks that take fractions of a second? That profile adds up active tasks - what's adding up to the other 5minutes? – xddsg Aug 08 '17 at 15:28
  • @xddsg - I have no idea what takes up the other 5 minutes. That's surely the entire thrust of my question? All I know is that ansible is driving the CPU to 135% while doing it. Is there some additional profiling output I'm missing, because the full output (that I trimmed here) did not seem to contain anything useful. – user53814 Aug 10 '17 at 10:21
  • @user53814 I understand if you are unable to put it up somewhere, but might it be possible to see a video of the run, starting with the command and ending on the profile output? Or, failing that, copy the top 200 tasks rather than 20 (there is a variable in the profile_tasks callback file you can modify after you make a copy). Really want to understand what is happening. – xddsg Aug 14 '17 at 11:52
  • @xddsg - before I go any further, are you of the opinion that this level of performance is somehow unusual? The more I look around, the more I find people reporting very high deployment-host CPU usage on even medium-sized playbooks, and a couple of hours spent inside the Ansible source have me wondering if it isn't just expensively written from a core event loop PoV. If my numbers are unusual I'll keep looking, but if yes, Ansible uses a lot of CPU on the deployment host with medium sized playbooks, there's no point continuing. – user53814 Aug 14 '17 at 16:19

1 Answers1

1

in your ansible.cfg set the following:

[defaults]

# profile each task
callback_whitelist = profile_tasks

# [don't validate host keys](http://docs.ansible.com/ansible/intro_configuration.html#host-key-checking)
host_key_checking = False

[ssh_connection]
pipelining = True

Also, in your playbook, set the strategy as 'free'

- hosts: all
  strategy: free
  tasks: [...]

Finally, disable fact gathering on your play: gather_facts: false

If, after profiling, you are seeing a lot of this:

TASK [pip foo]
ok: [10.192.197.252] => (item=ansible)
ok: [10.192.197.252] => (item=boto)
ok: [10.192.197.252] => (item=boto3)
ok: [10.192.197.252] => (item=passlib)
ok: [10.192.197.252] => (item=cryptography)

squash those actions in ansible.cfg under [defaults]:

e.g. squash_actions = yum,pip,bar

xddsg
  • 3,392
  • 2
  • 28
  • 33
  • Thanks for the reply. We're already using strategy:free and fact gathering I'm afraid is something the playbooks require, so that doesn't really work. As noted in my answer I'm already doing pipelining. – user53814 Jul 27 '17 at 15:48
  • @user53814 with the profiling turned on, what is taking the most time? can you update your question with the version of ansible you are using? – xddsg Aug 01 '17 at 10:03