I have a 6 machine cluster. The machines are:
HOST MEM (GB) CPU
mesos-primary-1 8 2
mesos-primary-2 8 2
mesos-primary-3 8 2
mesos-worker-1 1 1
mesos-worker-2 1 1
mesos-worker-3 1 1
My quorum size is set to 2.
The master machines have id's: 1, 2 and 3 respectively.
In the web UI, I have visited each individual IP of mesos-primary-1
, mesos-primary-2
and mesos-primary-3
on port 5050 and I receive no redirect to another one of the IP's from any of them.
The lack of the redirect leads me to believe that it is as if each machine thinks it is holding its own quorum or something, and this is why they fail to see one another and elect a leader.
Visiting port 8080
on any of the machines brings up an error because there is no elected leader, but it does resolve.
$ cat /etc/mesos-master/quorum
outputs 2 on each master machine.
I have also stopped/restarted everything. On the master nodes:
$ sudo service mesos-master stop\
sudo service marathon stop\
sudo service zookeeper stop\
sudo service mesos-master start\
sudo service marathon start\
sudo service zookeeper start
And on each of the slave machines
$ sudo service mesos-slave stop\
sudo service mesos-slave start
And still none of the slaves are detected and no leader elected.
My logs are clean on all 3 IPs (I got each one since there are no redirects), you can view each individual one here:
Log file created at: 2015/10/02 11:00:01
Running on machine: mesos-primary-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:01.532337 13722 logging.cpp:172] INFO level logging started!
I1002 11:00:01.532865 13722 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:01.532894 13722 main.cpp:231] Version: 0.24.1
I1002 11:00:01.532903 13722 main.cpp:234] Git tag: 0.24.1
I1002 11:00:01.532909 13722 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:01.533020 13722 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:01.546877 13722 leveldb.cpp:176] Opened db in 13.691496ms
I1002 11:00:01.550370 13722 leveldb.cpp:183] Compacted db in 2.522303ms
I1002 11:00:01.550559 13722 leveldb.cpp:198] Created db iterator in 118591ns
I1002 11:00:01.550618 13722 leveldb.cpp:204] Seeked to beginning of db in 1151ns
I1002 11:00:01.550642 13722 leveldb.cpp:273] Iterated through 0 keys in the db in 767ns
I1002 11:00:01.551029 13722 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:01.553994 13743 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:01.556193 13740 recover.cpp:449] Starting replica recovery
I1002 11:00:01.561755 13722 main.cpp:465] Starting Mesos master
I1002 11:00:01.563489 13740 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:01.568989 13722 master.cpp:378] Master 20151002-110001-2874854303-5050-13722 (159.203.90.171) started on 159.203.90.171:5050
I1002 11:00:01.569059 13722 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.90.171" --initialize_driver_logging="true" --ip="159.203.90.171" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:01.569535 13722 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:01.569581 13722 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:01.569608 13722 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:01.569718 13722 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:01.570199 13722 authenticator.cpp:512] Initializing server SASL
I1002 11:00:01.582969 13722 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:01.584786 13743 contender.cpp:149] Joining the ZK group
I1002 11:00:11.573873 13747 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
I1002 11:01:06.547200 13743 http.cpp:321] HTTP GET for /master/state.json from 173.243.85.102:51963 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'
Log file created at: 2015/10/02 11:00:01
Running on machine: mesos-primary-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:01.532337 13722 logging.cpp:172] INFO level logging started!
I1002 11:00:01.532865 13722 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:01.532894 13722 main.cpp:231] Version: 0.24.1
I1002 11:00:01.532903 13722 main.cpp:234] Git tag: 0.24.1
I1002 11:00:01.532909 13722 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:01.533020 13722 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:01.546877 13722 leveldb.cpp:176] Opened db in 13.691496ms
I1002 11:00:01.550370 13722 leveldb.cpp:183] Compacted db in 2.522303ms
I1002 11:00:01.550559 13722 leveldb.cpp:198] Created db iterator in 118591ns
I1002 11:00:01.550618 13722 leveldb.cpp:204] Seeked to beginning of db in 1151ns
I1002 11:00:01.550642 13722 leveldb.cpp:273] Iterated through 0 keys in the db in 767ns
I1002 11:00:01.551029 13722 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:01.553994 13743 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:01.556193 13740 recover.cpp:449] Starting replica recovery
I1002 11:00:01.561755 13722 main.cpp:465] Starting Mesos master
I1002 11:00:01.563489 13740 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:01.568989 13722 master.cpp:378] Master 20151002-110001-2874854303-5050-13722 (159.203.90.171) started on 159.203.90.171:5050
I1002 11:00:01.569059 13722 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.90.171" --initialize_driver_logging="true" --ip="159.203.90.171" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:01.569535 13722 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:01.569581 13722 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:01.569608 13722 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:01.569718 13722 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:01.570199 13722 authenticator.cpp:512] Initializing server SASL
I1002 11:00:01.582969 13722 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:01.584786 13743 contender.cpp:149] Joining the ZK group
I1002 11:00:11.573873 13747 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Log file created at: 2015/10/02 11:00:12
Running on machine: mesos-primary-3
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:12.609675 17105 logging.cpp:172] INFO level logging started!
I1002 11:00:12.610414 17105 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:12.610452 17105 main.cpp:231] Version: 0.24.1
I1002 11:00:12.610468 17105 main.cpp:234] Git tag: 0.24.1
I1002 11:00:12.610483 17105 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:12.610576 17105 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:12.618232 17105 leveldb.cpp:176] Opened db in 7.382537ms
I1002 11:00:12.619810 17105 leveldb.cpp:183] Compacted db in 1.512691ms
I1002 11:00:12.619876 17105 leveldb.cpp:198] Created db iterator in 27030ns
I1002 11:00:12.619910 17105 leveldb.cpp:204] Seeked to beginning of db in 1254ns
I1002 11:00:12.619925 17105 leveldb.cpp:273] Iterated through 0 keys in the db in 339ns
I1002 11:00:12.620028 17105 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:12.620930 17125 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:12.621615 17128 recover.cpp:449] Starting replica recovery
I1002 11:00:12.626735 17105 main.cpp:465] Starting Mesos master
I1002 11:00:12.627024 17128 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:12.633635 17123 master.cpp:378] Master 20151002-110012-321094504-5050-17105 (104.131.35.19) started on 104.131.35.19:5050
I1002 11:00:12.633828 17123 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="104.131.35.19" --initialize_driver_logging="true" --ip="104.131.35.19" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:12.635736 17123 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:12.635771 17123 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:12.635802 17123 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:12.635835 17123 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:12.636078 17123 authenticator.cpp:512] Initializing server SASL
I1002 11:00:12.643378 17125 contender.cpp:149] Joining the ZK group
I1002 11:00:12.643826 17123 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:22.633390 17130 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
I set up the machines along the guidelines given in this digital ocean guide.
Running
MASTER=$(mesos-resolve `cat /etc/mesos/zk`) mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5”
2015-10-02 12:30:26,137:14558(0x7f8dbb743700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@716: Client environment:host.name=mesos-primary-1
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-57-generic
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@725: Client environment:os.version=#95-Ubuntu SMP Fri Jun 19 09:28:15 UTC 2015
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@753: Client environment:user.dir=/root
2015-10-02 12:30:26,142:14558(0x7f8dbb743700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181 sessionTimeout=10000 watcher=0x7f8dc3625610 sessionId=0 sessionPasswd=<null> context=0x7f8da8003960 flags=0
2015-10-02 12:30:26,142:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.35.19:2181]
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.35.19:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.117.124:2181]
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.117.124:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:26,145:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [159.203.90.171:2181]
2015-10-02 12:30:26,147:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [159.203.90.171:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,484:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.35.19:2181]
2015-10-02 12:30:29,485:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.35.19:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,485:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.117.124:2181]
2015-10-02 12:30:29,486:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.117.124:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,487:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [159.203.90.171:2181]
2015-10-02 12:30:29,488:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [159.203.90.171:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Failed to detect master from 'zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos' within 5secs
root@mesos-primary-1:~# mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5"`
Does anyone have any ideas?