2

When I run 3 mesos-master with QUORUM=2, they fail 1 minute after being elected as the leader, giving errors:

E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25: Transport endpoint is not connected [107]

E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24: Transport endpoint is not connected [107]

They keep electing one another in a loop, consistently failing and re-electing.

If I set QUORUM=1, everything works well. What could be the reason for this?

aladagemre
  • 592
  • 5
  • 16
  • do you the one send email to mesos mail list? It has already solved? – haosdent Oct 16 '15 at 11:28
  • yes, that's me. One problem was that firewall was blocking reaching public IPs of the server and zookeeper was broadcasting public IP (set in advertise_ip) so nobody was able to connect each other. Slaves also couldnt connect to the masters with the same error. when I removed the firewall rule and set local IP to advertise_ip, slaves could connet. But haven't tries QUORUM=2 yet. – aladagemre Oct 16 '15 at 16:13
  • sounds great, if you could solve the problem finally or met new problem, please also send it to the mail list. So that others also could learn from your case. Thank you. :-) – haosdent Oct 17 '15 at 17:04
  • That's nice to hear :) I'll post to the mailing list on updates for sure. Hope I can find some solution. Thanks! – aladagemre Oct 17 '15 at 19:59
  • 1
    I have the same problem. This is not a good idea, but when i add the other nodes' IP in `/etc/hosts` everything work nice. – Majid Hajibaba Oct 26 '15 at 07:38

3 Answers3

1

One problem was that AWS firewall was blocking reaching public IPs of the server and zookeeper was broadcasting public IP (set in advertise_ip) so nobody was able to connect each other. Slaves also couldn't connect to the masters with the same error.

When I set local IP to advertise_ip (so that Zookeeper broadcasted local IPs), masters could communicate and QUORUM=2 worked. When I removed the firewall rule, slaves could connect to the master.

aladagemre
  • 592
  • 5
  • 16
1

We had a similar problem yesterday, marathon was a little weird because some applications were not been deployed. The strange was that the application goes up but the health check never turns green, and so nixy wasn't updating nginx.

After a lot of investigation we came to this very same error:

E0718 18:51:05.836688  5049 socket.hpp:107] Shutdown failed on fd=46: Transport endpoint is not connected [107]

In the end we discovery that the problem was in the election, even that our QUORUM=1 (we have 2 masters) somehow it looses itself and one master wasn't communicating with the other.

To solve this we triggered a new election using Marathon API /v2/leader DELETE method and everything worked fine after that.

0

We had the same problem, the mesos-master log flooding with messages like:

mesos-master[27499]: E0616 14:29:39.310302 27523 socket.hpp:174] Shutdown failed on fd=67: Transport endpoint is not connected [107]

Turned out it was the loadbalancers health check to /stats.json

Tarwin
  • 592
  • 5
  • 11