-1

I have a salt 2016.11.3 (Carbon) playground with a master in DigitalOcean and 4 minions in Azure (three ubuntu and 1 windows).

After a while ubuntu minions are not responding to salt -t 30 '*' test.ping but they are online ( I can ssh into them )

Restarting the master systemctl restart salt-master or minions systemctl restart salt-minion seems to bring minions back for a while.

Things checked:

  • Azure machines are put to sleep and only woken up on external events
  • The network between the two clouds is very slow
  • salt master machine is too small
  • salt minions do not ask master for "work"
  • salt-master hangs for some reason
  • salt-minion communication error ✔

Also after restart I get a double response from re-added nodes but I think this is a cache problem because it disappears after some time (cache invalidation).

OrangeDog
  • 36,653
  • 12
  • 122
  • 207
Alex Proca
  • 487
  • 4
  • 9
  • no, machines are not put to sleep. `The network between the two clouds is very slow` - too vague – 4c74356b41 Apr 11 '17 at 10:28
  • `salt -t 30` means wait 30 seconds for an answer from minions. If the response is not arriving in 30 seconds something is very slow. – Alex Proca Apr 11 '17 at 10:53
  • Not sure where your guesses came from, as Azure VMs don't magically go to sleep. If you started them, they're running. Which you've already proven because you were able to `ssh` into them. Same with network speed - I can't imagine that would ever be your issue, esp. for a connectivity test. Have you looked at how the minions listen for traffic, and how they deal with ports that eventually close due to timeout, and then retry? – David Makogon Apr 11 '17 at 12:29
  • Strikethrough my wrong guesses ( or else all azure people on stackoverflow will downvote my question :D ). Looking at saltstack minions now. – Alex Proca Apr 11 '17 at 18:19
  • 1
    first, check both salt-master and minion running the same version. – mootmoot Apr 12 '17 at 15:34
  • Same version. I created identical servers. – Alex Proca Apr 12 '17 at 19:33
  • Have you read the logs? Usually, when a minion is not controllable anymore there is somethings written in the corresponding logfile - either the master or the minion encouter an error. On your (ubuntu) minions check `/var/log/salt/minion` and on your master `/var/log/salt/master`. Please add error output that seems to be related to your answer. – dahrens Apr 13 '17 at 13:30
  • it seems like is a tonado ioloop error on all three ubuntu minions. `2017-04-12 19:46:25,193 [tornado.application][ERROR ][7093] Exception in callback ` but the ping response is as non-deterministic as usual. – Alex Proca Apr 13 '17 at 20:03

1 Answers1

1

It seems like is a communication error. There is an older 2013 bug report on saltstack github repo and someone states in comments that AWS and Azure load balancers don't respect TCP keepalives.

Suggested solutions:

  1. add a cron to ping minions each minute
  2. change some keepalive settings in Azure minions config file

Until now solution #2 works for me.

tcp_keepalive: True
tcp_keepalive_idle: 60
Alex Proca
  • 487
  • 4
  • 9