Online Saltstack Minions on Azure losses connection with Master on DigitalOcean

Question

I have a salt 2016.11.3 (Carbon) playground with a master in DigitalOcean and 4 minions in Azure (three ubuntu and 1 windows).

After a while ubuntu minions are not responding to salt -t 30 '*' test.ping but they are online ( I can ssh into them )

Restarting the master systemctl restart salt-master or minions systemctl restart salt-minion seems to bring minions back for a while.

Things checked:

~~Azure machines are put to sleep and only woken up on external events~~ ✗
~~The network between the two clouds is very slow~~ ✗
~~salt master machine is too small~~ ✗
~~salt minions do not ask master for "work"~~ ✗
~~salt-master hangs for some reason~~
salt-minion communication error ✔

Also after restart I get a double response from re-added nodes but I think this is a cache problem because it disappears after some time (cache invalidation).

no, machines are not put to sleep. `The network between the two clouds is very slow` - too vague — 4c74356b41, Apr 11 '17 at 10:28
`salt -t 30` means wait 30 seconds for an answer from minions. If the response is not arriving in 30 seconds something is very slow. — Alex Proca, Apr 11 '17 at 10:53
Not sure where your guesses came from, as Azure VMs don't magically go to sleep. If you started them, they're running. Which you've already proven because you were able to `ssh` into them. Same with network speed - I can't imagine that would ever be your issue, esp. for a connectivity test. Have you looked at how the minions listen for traffic, and how they deal with ports that eventually close due to timeout, and then retry? — David Makogon, Apr 11 '17 at 12:29
Strikethrough my wrong guesses ( or else all azure people on stackoverflow will downvote my question :D ). Looking at saltstack minions now. — Alex Proca, Apr 11 '17 at 18:19
first, check both salt-master and minion running the same version. — mootmoot, Apr 12 '17 at 15:34
Have you read the logs? Usually, when a minion is not controllable anymore there is somethings written in the corresponding logfile - either the master or the minion encouter an error. On your (ubuntu) minions check `/var/log/salt/minion` and on your master `/var/log/salt/master`. Please add error output that seems to be related to your answer. — dahrens, Apr 13 '17 at 13:30
it seems like is a tonado ioloop error on all three ubuntu minions. `2017-04-12 19:46:25,193 [tornado.application][ERROR ][7093] Exception in callback ` but the ping response is as non-deterministic as usual. — Alex Proca, Apr 13 '17 at 20:03

score 1 · Accepted Answer · answered Apr 15 '17 at 11:25

It seems like is a communication error. There is an older 2013 bug report on saltstack github repo and someone states in comments that AWS and Azure load balancers don't respect TCP keepalives.

Online Saltstack Minions on Azure losses connection with Master on DigitalOcean

1 Answers1