There are many reasons why ansible may hang at fact gathering, but before going any further, here is the first test you should be making in any such situation :
ansible -m ping <hostname>
This test just connects to the host, and executes enough code to return :
<hostname> | SUCCESS => {
"changed": false,
"ping": "pong"
}
If this works, you can pretty much rule out any setup or connectivity issue, as it proves that you could resolve target hostname, open a connection, authenticate, and execute an ansible module with the remote python interpreter.
Now, here is a (non-exhaustive) list of things that can go wrong at the beginning of a playbook :
The command executed by ansible is waiting for an interactive input
I can remember this happening on older ansible versions, where a command would wait for an interactive input that would never come, such as a sudo password (when you forgot a -K
switch), or acceptation of a new ssh host fingerprint (for a new target host).
Modern versions of ansible handle both these cases gracefully and raise an error immediately for normal usecases, so unless you're doing things such as calling ssh or sudo yourself, you shouldn't have this kind of issue. And even if you did, it would be after fact gathering.
Dead ssh master connection
There are some very interesting options passed to the ssh client, in the debug log given here :
ControlMaster=auto
ControlPersist=60s
ControlPath=/home/vagrant/.ansible/cp/ansible-ssh-%h-%p-%r
These options are documented in man ssh_config.
By default, ansible will try and be smart regarding its ssh connection use. For a given host, instead of creating a new connection for each and every task in the play, it will open it once, and keep it open for the whole playbook (and even across playbooks).
That's good, as establishing a new connection is far slower and computation-intensive than using an already existing one.
In practice, every ssh connection will check for the existence of a socket at ~/.ansible/cp/some-host-specific-path
.
The first connection cannot find it, so it connects normally, and then creates it.
Every subsequent connection will then just use this socket to go through the already established connection.
Even if the established connection finally times out and closes after not being used for long enough, the socket is closed too, and we're back to square one.
So far so good.
Sometimes however, the connection actually dies, but the ssh client still considers it established. This typically happens when you execute the playbook from you laptop, and you lose your WiFi connection (or switch from WiFi to Ethernet, etc…)
This last example is a terrible situation : you can ssh to the target machine with a default ssh config, but as long as your previous connection is still considered active, ansible won't even try establishing a new one.
At this point, we just want to get rid of this old socket, and the simplest way to do that is to remove it:
# Delete all the current sockets (may disrupt currently running playbooks)
rm -r ~/.ansible/cp
# Delete only the affected socket (requires to know which one it is)
rm ~/.ansible/cp/<replace-by-your-socket>
This is perfect for a one-shot fix, but if it happens too often, you may need to look for a longer-term fix. Here are some pointers that might help towards this goal :
- Start playbooks from a server (with a network connection way more stable than your laptop's)
- Use ansible configuration, or directly ssh client configuration to disable connection sharing
- Use the same resources, but to fine-tune timeouts, so that a master connection crash actually times out faster
Please note that at the time of writing, a few options have changed (for example, my latest run gave me ControlPath=/home/toadjaune/.ansible/cp/871b533295
), but the general idea is still valid.
Fact gathering actually taking too much time
At the beginning of every play, ansible collects a lot of information on the target system, and puts it into Facts. These are variables that you can then use in your playbook, and are usually really handy, but sometimes, getting this info can be very long (bad mount points, disks with high i/o, high load…)
This being said, you don't strictly need facts to run a playbook, and almost certainly not all of them, so let's try and disable what we don't need. Several options for that :
- Completely disable the setup module
- Change the configuration of the setup module to include only certain parts of it.
For debugging purposes, it is really convenient to invoke the setup module directly from the command-line :
ansible -m setup <hostname>
This last command should hang as well as your playbook, and eventually timeout (or succeed). Now, let's execute the module again, disabling everything we can :
ansible -m setup -a gather_subset='!all' <hostname>
If this still hangs, you can always try and disable totally the module in your play, but it's really likely that your problem is somewhere else.
If, however, it works fine (and quickly), then have a look at the module documentation. You have two options :
- Limit the fact gathering to a subset, excluding what you don't need (see possible values for
gather_subset
)
gather_timeout
can also help you fix your issue, by allowing more time (although that would be to fix a timeout error, not a hang)
Other issues
Obviously, other things can go wrong. A few pointers to help debugging :
- Use ansible maximum verbosity level (
-vvvv
), as it will show you every command executed
- Use
ping
and setup
modules directly from the command-line as explained above
- Try to ssh manually if
ansible -m ping
doesn't work