88

I'm having some odd issues with my ansible box(vagrant).

Everything worked yesterday and my playbook worked fine.

Today, ansible hangs on "gathering facts"?

Here is the verbose output:

<5.xxx.xxx.xxx> ESTABLISH CONNECTION FOR USER: deploy
<5.xxx.xxx.xxx> REMOTE_MODULE setup
<5.xxx.xxx.xxx> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-
o', 'ControlPersist=60s', '-o', 'ControlPath=/home/vagrant/.ansible/cp/ansible-s
sh-%h-%p-%r', '-o', 'Port=2221', '-o', 'KbdInteractiveAuthentication=no', '-o',
'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o
', 'PasswordAuthentication=no', '-o', 'User=deploy', '-o', 'ConnectTimeout=10',
'5.xxx.xxx.xxx', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1411372677
.18-251130781588968 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1411372677.18-2
51130781588968 && echo $HOME/.ansible/tmp/ansible-tmp-1411372677.18-251130781588
968'"]
030
  • 5,901
  • 13
  • 68
  • 110
Bj Blazkowicz
  • 991
  • 1
  • 6
  • 9
  • 1
    It hangs for how much time? Did you try `vagrant ssh` and investigate during the hang to see if there is anything useful in `ps` and `netstat`? Also, one of the first suspects in hangs is DNS - check if DNS is resolving from inside the virtual machine. – Antonis Christofides Sep 22 '14 at 09:13
  • 1
    Thanks for you comment. The solution was simple, vagrant destroy and vagrant up... I still think it's weird that it just stopped working? – Bj Blazkowicz Sep 22 '14 at 09:39
  • 1
    I had an issue with Ansible stalling out if there's an inaccessible (cifs-) mounts. – rektide May 13 '15 at 19:46
  • 1
    Just had it happen, it was caused by an outdated host key in the known_hosts file. Weird that the connection didn't fail as is usual in this case. – GnP Aug 03 '15 at 19:51
  • Can you check sshd logs in the vagrant box? You may need to set "LogLevel DEBUG" in /etc/ssh/sshd_config but that may provide more info of what's going on. – Pablo Martinez Dec 14 '15 at 14:25
  • I looked at the below - and didn't find anything there. ansible tmp/setup was running as a python process on the target box (not vagrant but a vm), but was taking a very long time and doing something very IO heavy. I had to kill -9 and wait for it to stop after about 5 minutes. – Danny Staple Jan 06 '16 at 15:30
  • sudo apt install -y ansible sshpass will fix this – Shawn Aug 13 '21 at 16:36

18 Answers18

69

I was having a similar issue with Ansible ping on Vagrant, it just suddenly stuck for no reason and has previously worked absolutely fine. Unlike any other issue like ssh or connective issue, it just forever die with no timeout.

One thing I did to resolve this issue is to clean ~/.ansible directory and it just works again. I can't find out why, but it did get resolved.

If you got change to have it again try clean the ~/.ansible folder before you refresh your Vagrant.

Yuri
  • 220
  • 1
  • 6
yikaus
  • 811
  • 5
  • 6
29

Ansible can hang like this for a number of reasons, usually because of a connection problem or because the setup module hangs. Here's how to narrow the problem down so you can solve it.

Ansible cannot connect to the destination host

Host Key (known_hosts) Problems

1) On older versions of Ansible (2.1 or older), Ansible would not always tell you if the host key for the destination does not exist on the source, or if there is a mismatch.

Solution: try opening an SSH connection with the same parameters to that destination. You may find SSH errors you need to resolve, and then the command will work.

2) Sometimes Ansible displays an SSH connection message to you in the midst of other statuses, causing Ansible to "freeze" on that task:

Warning: the ECDSA host key for 'myhost' differs from the key for the IP address '10.10.1.10'
Offending key for IP in /etc/ssh/ssh_known_hosts:246
Matching host key in /etc/ssh/ssh_known_hosts:477
Are you sure you want to continue connecting (yes/no)?

In this case, simply typing "yes" for as many SSH questions as you were asked will permit the play to continue. Afterwards you can fix the root known_hosts problems.

Private Key Authentication Problems

If using key-based authentication vs password, other problems include:

  • Private key may not be set up properly on the destination
  • Private key might have incorrect permissions locally (should be readable only by the user running the Ansible job)

Solution: try running ansible -m ping <destination> -k against the problem host - if that doesn't work, try the Host Key Problems solutions above.

Ansible cannot quickly gather facts

The setup module (when run automatically at the beginning of an ansible-playbook run, or when run manually as ansible -m setup <host>) can often hang when gathering hardware facts (e.g. if getting disk information from hosts with high i/o, bad mount entries, etc.).

Solution: try running ansible -m setup -a gather_subset=!all <destination>. If this works, you should consider setting this line in your ansible.cfg:

gather_subset=!hardware
Jordan
  • 391
  • 4
  • 6
  • 2
    Passing to 'gather_subset=!hardware' to setup worked for a particular VM that was not responding. – JamesP Apr 27 '17 at 09:28
  • 2
    Fixed for me. Dodgy mount points, I think. I had a VM that I used for ansible provisioning and it worked until I added a new NFS share. Now it doesn't, until I added the above. – David Boshton Oct 06 '17 at 23:45
  • Turned out to be a host key problem in my case. The host was reimaged, so my first run failed and I ran the suggested `ssh-keygen -R` command to remove the offending key. I ran ssh once to get the key added, but the second run was hanging. When I ran ssh again, I got the key confirmation prompt which was unexpected. I realized that there is an offending key that needed to be removed, so after removing that and rerunning ssh, I got the `Warning: Permanently added the ECDSA host key ...` message and then only the fact gathering continued. – haridsv Oct 12 '18 at 11:08
  • I can confirm the observation from @DavidBoshton. Had this issue on a VM that had NFS directories mounted, that weren't available (NFS server problem). After fixing the NFS server it worked – tschale Nov 14 '18 at 15:14
  • it can also be that the private ssh key is protected by a password and that key was not added to ssh agent (check with `ssh-add -l`) – Thomasleveil May 11 '20 at 13:18
27

For me the setup module module was stuck on a dead NFS mount.

If you do a "df" on your machine and nothing happens, you may be on the same case.

PS: if you can't umount the NFS share/mountpoint, consider using the bad "umount -l"

25

There are many reasons why ansible may hang at fact gathering, but before going any further, here is the first test you should be making in any such situation :

ansible -m ping <hostname>

This test just connects to the host, and executes enough code to return :

<hostname> | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}

If this works, you can pretty much rule out any setup or connectivity issue, as it proves that you could resolve target hostname, open a connection, authenticate, and execute an ansible module with the remote python interpreter.

Now, here is a (non-exhaustive) list of things that can go wrong at the beginning of a playbook :

The command executed by ansible is waiting for an interactive input

I can remember this happening on older ansible versions, where a command would wait for an interactive input that would never come, such as a sudo password (when you forgot a -K switch), or acceptation of a new ssh host fingerprint (for a new target host).

Modern versions of ansible handle both these cases gracefully and raise an error immediately for normal usecases, so unless you're doing things such as calling ssh or sudo yourself, you shouldn't have this kind of issue. And even if you did, it would be after fact gathering.

Dead ssh master connection

There are some very interesting options passed to the ssh client, in the debug log given here :

  • ControlMaster=auto
  • ControlPersist=60s
  • ControlPath=/home/vagrant/.ansible/cp/ansible-ssh-%h-%p-%r

These options are documented in man ssh_config.

By default, ansible will try and be smart regarding its ssh connection use. For a given host, instead of creating a new connection for each and every task in the play, it will open it once, and keep it open for the whole playbook (and even across playbooks).

That's good, as establishing a new connection is far slower and computation-intensive than using an already existing one.

In practice, every ssh connection will check for the existence of a socket at ~/.ansible/cp/some-host-specific-path. The first connection cannot find it, so it connects normally, and then creates it. Every subsequent connection will then just use this socket to go through the already established connection.

Even if the established connection finally times out and closes after not being used for long enough, the socket is closed too, and we're back to square one.

So far so good.

Sometimes however, the connection actually dies, but the ssh client still considers it established. This typically happens when you execute the playbook from you laptop, and you lose your WiFi connection (or switch from WiFi to Ethernet, etc…)

This last example is a terrible situation : you can ssh to the target machine with a default ssh config, but as long as your previous connection is still considered active, ansible won't even try establishing a new one.

At this point, we just want to get rid of this old socket, and the simplest way to do that is to remove it:

# Delete all the current sockets (may disrupt currently running playbooks)
rm -r ~/.ansible/cp
# Delete only the affected socket (requires to know which one it is)
rm ~/.ansible/cp/<replace-by-your-socket>

This is perfect for a one-shot fix, but if it happens too often, you may need to look for a longer-term fix. Here are some pointers that might help towards this goal :

  • Start playbooks from a server (with a network connection way more stable than your laptop's)
  • Use ansible configuration, or directly ssh client configuration to disable connection sharing
  • Use the same resources, but to fine-tune timeouts, so that a master connection crash actually times out faster

Please note that at the time of writing, a few options have changed (for example, my latest run gave me ControlPath=/home/toadjaune/.ansible/cp/871b533295), but the general idea is still valid.

Fact gathering actually taking too much time

At the beginning of every play, ansible collects a lot of information on the target system, and puts it into Facts. These are variables that you can then use in your playbook, and are usually really handy, but sometimes, getting this info can be very long (bad mount points, disks with high i/o, high load…)

This being said, you don't strictly need facts to run a playbook, and almost certainly not all of them, so let's try and disable what we don't need. Several options for that :

For debugging purposes, it is really convenient to invoke the setup module directly from the command-line :

ansible -m setup <hostname>

This last command should hang as well as your playbook, and eventually timeout (or succeed). Now, let's execute the module again, disabling everything we can :

ansible -m setup -a gather_subset='!all' <hostname>

If this still hangs, you can always try and disable totally the module in your play, but it's really likely that your problem is somewhere else.

If, however, it works fine (and quickly), then have a look at the module documentation. You have two options :

  • Limit the fact gathering to a subset, excluding what you don't need (see possible values for gather_subset)
  • gather_timeout can also help you fix your issue, by allowing more time (although that would be to fix a timeout error, not a hang)

Other issues

Obviously, other things can go wrong. A few pointers to help debugging :

  • Use ansible maximum verbosity level (-vvvv), as it will show you every command executed
  • Use ping and setup modules directly from the command-line as explained above
  • Try to ssh manually if ansible -m ping doesn't work
toadjaune
  • 431
  • 5
  • 4
10

I had a similar issue with Ansible hanging at Gathering Facts. I pared my script down to a prompt with no tasks or roles and it still hung.

I found 12 hung ansible processes in my process list that had accumulated over the day.

/usr/bin/python /tmp/ansible_Jfv4PA/ansible_module_setup.py
/usr/bin/python /tmp/ansible_M2T10L/ansible_module_setup.py

Once I killed those, it started working again.

Tim Moses
  • 101
  • 1
  • 2
  • well, sometimes, i start ansible, then i kill it in the beginning, but the ssh connection stay active/alive; this answer helped me a lot. – mik3fly-4steri5k Dec 27 '20 at 16:00
4

Dmytro is on to something!

Ansible uses the FQDN of the host. If your host is not DNS resolvable and you don't have a mapping in /etc/hosts ansible will wait for the DNS to timeout.

By adding ::1 <fqdn> in the host file of the machines you are connecting Ansible will get the FQDN immediately without going through DNS.

Note that the host should lookup up hosts from /etc/hosts, this is the default for most, if not all, linux systems but if your editing /etc/nsswitch.conf as well that might be an issue.

lafka
  • 141
  • 1
3

I had the same issue. Got no useful information from running ansible in verbose mode.

The server was re provisioned before running the playbook.

Removing the server from known host list fixed this using the below command.

$ ssh-keygen -f "~/.ssh/known_hosts" -R <hostname>
$ ssh-keygen -f "~/.ssh/known_hosts" -R <ip_address>

Note: You need to remove both the hostname and the ip address

rleon
  • 31
  • 2
3

I've fixed the cause of this issue by following the advice from Why my ansible-playbook hangs in “Gathering facts”? blog post.

It can be simplified to:

  1. Set DEFAULT_KEEP_REMOTE_FILES=yes to preserve the commands and enable -vvvv

  2. Run the playbook again.

  3. When the play stucks copy the last shell command printed (the part after /bin/sh -c)

  4. Log on the server via ssh.

  5. Use strace to replay the last step of the play. The step command is copied from the -vvv output. For example: strace -f /bin/sh -c "echo BECOME-SUCCESS-ltxvshvezrnmumzdprccoiekhjheuwxt; /usr/bin/python /home/user/.ansible/tmp/ansible-tmp-1527099315.31-224479822965785/setup.py"

  6. Check on which call the "straced" step stuck and fix it :)

In my case it was an inaccessible network drive...

Yuri
  • 220
  • 1
  • 6
2

Folks are very unlikely to run into the scenario that caused this issue for me but just in case... My playbook would run fine once but on subsequent runs it would get stuck at Gathering facts and then timeout. I eventually figured out that one of my tasks was configuring /opt/rh/devtoolset-8/enable to run via an /etc/profile.d link. Well, well, well all of a sudden sudo now refers to /opt/rh/devtoolset-8/root/usr/bin/sudo and it does not recognize the parameters that Ansible tries to use.

Keith Hill
  • 21
  • 1
1

I don't know if you are using a sudo playbook - but i was, and it was hanging on the sudo password.

From the documentation - you can kill that, and then use -K as well.

Good luck.

Rcynic
  • 111
  • 3
1

Maybe the Fingerprint of your target system has changed, for example when you reinstall the server OS. You have to delete the entries in known_hosts, ansible will not notify that a non-trusted entry is the issue, it just gets stuck exactly as you describe.

Schroeffu
  • 11
  • 1
1

It sounds that ansible unable to authenticate... so use -k to let ansible ask for server password .... as shown below:

ansible-playbook  -K -i hosts playbook.yml -vvvv
0x3bfc
  • 129
  • 2
1

In my case ansible stopped working in the middle of a task. The reason was because my ssh-agent stopped working (ssh-add -l was not returning anything). I restarted everything and it worked again. So check if your ssh-agent is working properly (ssh-add -l should not get stuck).

Vasco
  • 111
  • 1
0

I had this problem on a new macOS install. I'd just installed /Applications/Xcode.app/ from XIP but hadn't accepted the license yet. (I think I had accepted the Xcode Command Line Tools license, but running xcode-select -p showed /Applications/Xcode.app/Contents/Developer so not sure if that's relevant).

I could SSH to the new machine, but when I ran /usr/bin/python3 it said I needed to accept the Xcode license. After I ran the command with sudo, pressed space many times, and typed agree then python3 worked like expected, and ansible started working.

(It seems both Windows and macOS are guilty of shipping a python binary that just prompts you to install something: useful for GUI users but very unhelpful in automation...)

Carl Walsh
  • 221
  • 2
  • 4
0

FQDN and hostname mismatch can also cause ansible hangout. I have used FQDN with domain differs from hostname domain. After making both equal, ansible works perfectly. Possiblly ansible compares FQDN and hostname before executing tasks on remote host. Hope it helps!

0

I solved this issue by reseting the vagrant box

vagrant destroy
vagrant up
Quanlong
  • 111
  • 4
0

Sudo's password is the problem. Make sure that (1) you can issue 'sudo anything' on newly opened terminal (where password in not cached) without providing one (2) that puppet hasn't reversed your earlier manual 'sudoers' changes.

witkacy26
  • 21
  • 3
0

Deleting ~/.ansible alone didn't do it for me. So to check what's in that directory I just did a ctrl-z (put process to sleep) and checked, and then continued the ansible process via fg. I didn't delete anything in that case. but after it just continued. So I just tried the ctrl-z->fg alone and it also worked. Feels like rain dance, but if someone else is stuck, please also try that.

erikbstack
  • 139
  • 2
  • 9