2

i have a setup where i am using 3 mesos masters and 3 mesos slasves. after making all the required configurations i can see 3 mesos masters are part of a cluster which is maintained by zookeepers.

now i have setup 3 mesos slaves and when i am starting mesos-slave service, i am expecting that mesos slaves will be available to the mesos masters web UI page. But i can not see any of them in the slaves tab.

selinux, firewall, iptalbes all are disabled. able to perform ssh between node.

[cloud-user@slave1 ~]$ sudo systemctl status mesos-slave -l
   mesos-slave.service - Mesos Slave
   Loaded: loaded (/usr/lib/systemd/system/mesos-slave.service; enabled)
   Active: active (running) since Sat 2016-01-16 16:11:55 UTC; 3s ago
   Main PID: 2483 (mesos-slave)
   CGroup: /system.slice/mesos-slave.service
           ├─2483 /usr/sbin/mesos-slave --master=zk://10.0.0.2:2181,10.0.0.6:2181,10.0.0.7:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins
           ├─2493 logger -p user.info -t mesos-slave[2483]
           └─2494 logger -p user.err -t mesos-slave[2483]

Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628670  2497 detector.cpp:482] A new leading master (UPID=master@127.0.0.1:5050) is detected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628732  2497 slave.cpp:729] New master detected at master@127.0.0.1:5050
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628825  2497 slave.cpp:754] No credentials provided. Attempting to register without authentication
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628844  2497 slave.cpp:765] Detecting new master
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628872  2497 status_update_manager.cpp:176] Pausing sending status updates
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.628922  2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093  2502 slave.cpp:3215] master@127.0.0.1:5050 exited
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: W0116 16:11:55.629107  2502 slave.cpp:3218] Master disconnected! Waiting for a new master to be elected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.983531  2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:57 slave1.novalocal mesos-slave[2494]: E0116 16:11:57.465049  2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Sunil
  • 55
  • 2
  • 8

1 Answers1

4

So the problematic line is:

Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093  2502 slave.cpp:3215] master@127.0.0.1:5050 exited

Specifically, note it's detecting the master as having the IP address 127.0.0.1. The Mesos Agent[1] sees that IP address, and tries to connect which fails (The master isn't running on the same machine as the agent).

This happens because the master announces what it thinks it's IP address is into Zookeeper. In your case, the master is thinking it's IP is 127.0.0.1 and then storing that into zk. Mesos has several configuration flags to control this behavior, mainly --hostname, --no-hostname_lookup, --ip, --ip_discovery_command, and via setting the environment variable LIBPROCESS_IP. See http://mesos.apache.org/documentation/latest/configuration/ for details about them and what they do.

The best thing you can do to make sure things work out of the box is to make sure the machines have resolvable hostnames. Mesos does a reverse-DNS lookup of the boxes hostname in order to figure out what IP people will contact it from.

If you can't get the hostnames setup properly, I would recommend setting --hostname and --ip manually which should cause mesos to announce exactly what you want.

[1]The mesos slave has been renamed to agent, see: https://issues.apache.org/jira/browse/MESOS-1478

  • That's exactly what I'd have written myself :-) – Tobi Jan 18 '16 at 09:14
  • Thanks Firebird, Issue is fixed, – Sunil Jan 21 '16 at 14:31
  • I tried several things. Only the --ip works. And I have to parse ip from ifconfig, which is a hassle. Is there a way to make the hostname resolution properly done in Vagrant? – Gordon Sun Jan 29 '16 at 11:44
  • But, use --ip cause the console to not load for some reason – Gordon Sun Jan 29 '16 at 12:01
  • You need every box in the vagrant to have fully resolvable hostnames from every other box in order to make a Multi-Node vagrant without setting --ip and --hostname to work. In particular, a machine needs to be able to look up it's own hostname using DNS and get back an A record. Every other machine must be able to lookup that hostname and get back exactly the same A record. – Firebird347 Feb 01 '16 at 02:24
  • Actually, make sure that you stop all zookeeper instances, and restart them afterwards... – Tobi Apr 12 '16 at 18:58