0

Env:

  Zookeeper on computer A,
  Mesos master on computer B as Leader,
  Mesos master on computer C,
  Marathon on computer B singleton.

Action:

Kill Mesos master task on computer B, attempt to change mesos cluster leader

Result:

  Mesos cluster leader change to mesos master on computer C, 
  But Marathon task on computer auto shutdown with following logs.

Question:

Somebody can help me why marathon down? and how to fix it!

Logs:

I1109 12:19:10.010197 11287 detector.cpp:152] Detected a new leader: (id='9')
I1109 12:19:10.010646 11291 group.cpp:699] Trying to get '/mesos/json.info_0000000009' in ZooKeeper
I1109 12:19:10.013425 11292 zookeeper.cpp:262] A new leading master (UPID=master@10.4.23.55:5050) is detected
[2017-11-09 12:19:10,015] WARN  Disconnected (mesosphere.marathon.MarathonScheduler:Thread-23)
I1109 12:19:10.018977 11292 sched.cpp:2021] Asked to stop the driver
I1109 12:19:10.019161 11292 sched.cpp:336] New master detected at master@10.4.23.55:5050
I1109 12:19:10.019892 11292 sched.cpp:1203] Stopping framework d52cbd8c-1015-4d94-8328-e418876ca5b2-0000
[2017-11-09 12:19:10,020] INFO  Driver future completed with result=Success(()). (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO  Abdicating leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO  Stopping the election service (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,029] INFO  backgroundOperationsLoop exiting (org.apache.curator.framework.imps.CuratorFrameworkImpl:Curator-Framework-0)
[2017-11-09 12:19:10,061] INFO  Session: 0x15f710ffb010058 closed (org.apache.zookeeper.ZooKeeper:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,061] INFO  EventThread shut down for session: 0x15f710ffb010058 (org.apache.zookeeper.ClientCnxn:pool-3-thread-1-EventThread)
[2017-11-09 12:19:10,063] INFO  Stopping MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,063] INFO  Lost leadership (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,066] INFO  All actors suspended:
* Actor[akka://marathon/user/offerMatcherStatistics#-1904211014]
* Actor[akka://marathon/user/reviveOffersWhenWanted#-238627718]
* Actor[akka://marathon/user/expungeOverdueLostTasks#608979053]
* Actor[akka://marathon/user/launchQueue#803590575]
* Actor[akka://marathon/user/offersWantedForReconciliation#598482724]
* Actor[akka://marathon/user/offerMatcherLaunchTokens#813230776]
* Actor[akka://marathon/user/offerMatcherManager#1205401692]
* Actor[akka://marathon/user/instanceTracker#1055980147]
* Actor[akka://marathon/user/killOverdueStagedTasks#-40058350]
* Actor[akka://marathon/user/taskKillServiceActor#-602552505]
* Actor[akka://marathon/user/rateLimiter#-911383474]
* Actor[akka://marathon/user/deploymentManager#2013376325] (mesosphere.marathon.core.leadership.impl.LeadershipCoordinatorActor:marathon-akka.actor.default-dispatcher-10)
I1109 12:19:10.069551 11272 sched.cpp:2021] Asked to stop the driver
[2017-11-09 12:19:10,068] INFO  Stopping driver (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,069] INFO  Stopped MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,070] INFO  Terminating due to leadership abdication or failure (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,071] INFO  Call postDriverRuns callbacks on  (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,074] INFO  Now standing by. Closing existing handles and rejecting new. (mesosphere.marathon.core.event.impl.stream.HttpEventStreamActor:marathon-akka.actor.default-dispatcher-12)
[2017-11-09 12:19:10,074] INFO  Suspending scheduler actor (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-2)
[2017-11-09 12:19:10,083] INFO  Finished postDriverRuns callbacks (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,084] INFO  ExpungeOverdueLostTasksActor has stopped (mesosphere.marathon.core.task.jobs.impl.ExpungeOverdueLostTasksActor:marathon-akka.actor.default-dispatcher-9)
[1]+  Exit 137
S.I.
  • 3,250
  • 12
  • 48
  • 77

2 Answers2

0

I think there is wrong configuration in zookeeper cluster. Use 3 zookeeper cluster and 2 mesos master n multiple slaves. Ref : https://www.google.co.in/amp/s/beingasysadmin.wordpress.com/2014/08/16/managing-ha-docker-cluster-using-multiple-mesos-masters/amp/

sanath meti
  • 5,179
  • 1
  • 21
  • 30
  • Thank for your help, i will try it ! – user8911259 Nov 23 '17 at 06:51
  • After do as you said, it still not working. i start up 3 zookeeper cluster on 3 computer and 3 mesos master on 3 computer, when i change mesos leader by kill current mesos leader, the marathon shutdown. – user8911259 Dec 04 '17 at 04:42
0

Did you set masters reference to marathon conf? can you do

cat /etc/marathon/conf/master
Dino L.
  • 137
  • 8
  • it print nothing, i was using command flags to startup marathon, as `nohup ${SHELL_FILE}/start ${mesos_authentication} ${mesos_authentication_principal} ${mesos_authentication_secret} ${local_port_min} ${mesos_role} ${mesos_user} ${offer_matching_timeout} ${save_tasks_to_launch_timeout} --http_port ${HTTP_PORT} --https_port ${HTTPS_PORT} ${MASTER} ${HA_MODE} ${zk_timeout} ${zk_session_timeout} --hostname ${IP} --http_credentials ${HTTP_CREDENTIALS} ${OPTIONS} > ${LOG_FILE} 2>&1 & echo $! > pid`. Maybe configuration flags using file under /etc/marathon works, i will try it. – user8911259 Dec 04 '17 at 04:45
  • Check /var/log/syslog for something like this `Initiating client connection, connectString=` after `Client environment:...` lines – Dino L. Dec 04 '17 at 09:36