0

I am working around with marathon & mesos & docker very well, but it recently discovered a problem.when mesos-slave encounter an Exception , the state of task on Marathon will change to TASK_LOST , and the task can not be killed only after about 15mins.

I did a test by manually Reboot My Operation System that run mesos-slave service and docker and run the task, and then the task state shown in Marathon UI became to " Unscheduled(100%) " ,and the task can not be killed automatically either manually, until past about 15 minutes. My question is how to reduce this time? I tried to add marathon startup command line args with

task_launch_confirm_timeout=30000
scale_apps_interval = 30000
task_lost_expunge_initial_delay = 30000
task_launch_timeout = 30000

and add mesos-slave startup command line args with

recovery_timeout=1mins

but it doesn't work for me.

Colwin
  • 2,655
  • 3
  • 25
  • 25
Jackie
  • 11
  • 1

1 Answers1

2

To forcefully change the time after executor commit suicide if Mesos agent process failed you should configure --recovery_timeout

Amount of time allotted for the agent to recover. If the agent takes longer than recovery_timeout to recover, any executors that are waiting to reconnect to the agent will self-terminate. (default: 15mins)

janisz
  • 6,292
  • 4
  • 37
  • 70
  • It seems unreachable strategy doesn't work for me,addtitional ,I am sorry to forget to given my version Marathon 1.4.3 mesos-master & mesos-slave 1.1.0 – Jackie May 22 '17 at 03:59
  • I misunderstand the question? You are asking how to reduce time after task is killed when mesos-agent fails? I changed answer becouse unreachable strategy tells how marathon should handle it and `recovery_timeout` controls the time you are asking for. – janisz May 22 '17 at 08:35
  • My Question is When the Machine down(Such as a Suddenly Reboot),I want to kill the Task quickly and Start a new Task on other Machine. But when the Situation occurs,I found that the task cannot be killed and rescaled! – Jackie May 22 '17 at 10:08
  • So you need to combine `unrachableStrategy` and `recovery_timeout`. Task will commit suicide if it can't connect to agent and marathon should start new task when task is lost. – janisz May 22 '17 at 10:21
  • thanks you first @janisz,I am sorry that My English is not so good,but I has reeditted my question,and you can understand my question more clearly,look forward your help the new questiong link is https://stackoverflow.com/questions/44113232/how-to-auto-launch-new-task-instance-when-mesos-slave-stopped – Jackie May 22 '17 at 12:47