how to reduce task kill period time when task state is TASK_LOST?

Question

I am working around with marathon & mesos & docker very well, but it recently discovered a problem.when mesos-slave encounter an Exception , the state of task on Marathon will change to TASK_LOST , and the task can not be killed only after about 15mins.

I did a test by manually Reboot My Operation System that run mesos-slave service and docker and run the task, and then the task state shown in Marathon UI became to " Unscheduled(100%) " ，and the task can not be killed automatically either manually, until past about 15 minutes. My question is how to reduce this time? I tried to add marathon startup command line args with

task_launch_confirm_timeout=30000
scale_apps_interval = 30000
task_lost_expunge_initial_delay = 30000
task_launch_timeout = 30000

and add mesos-slave startup command line args with

recovery_timeout=1mins

but it doesn't work for me.

janisz · Answer 1 · 2017-05-22T08:33:27.040

2

To forcefully change the time after executor commit suicide if Mesos agent process failed you should configure --recovery_timeout

Amount of time allotted for the agent to recover. If the agent takes longer than recovery_timeout to recover, any executors that are waiting to reconnect to the agent will self-terminate. (default: 15mins)

edited May 22 '17 at 08:33

answered May 19 '17 at 10:00

janisz

6,292
4
37
70

It seems unreachable strategy doesn't work for me,addtitional ,I am sorry to forget to given my version Marathon 1.4.3 mesos-master & mesos-slave 1.1.0 – Jackie May 22 '17 at 03:59
I misunderstand the question? You are asking how to reduce time after task is killed when mesos-agent fails? I changed answer becouse unreachable strategy tells how marathon should handle it and `recovery_timeout` controls the time you are asking for. – janisz May 22 '17 at 08:35
My Question is When the Machine down(Such as a Suddenly Reboot)，I want to kill the Task quickly and Start a new Task on other Machine. But when the Situation occurs,I found that the task cannot be killed and rescaled! – Jackie May 22 '17 at 10:08
So you need to combine `unrachableStrategy` and `recovery_timeout`. Task will commit suicide if it can't connect to agent and marathon should start new task when task is lost. – janisz May 22 '17 at 10:21
thanks you first @janisz,I am sorry that My English is not so good,but I has reeditted my question,and you can understand my question more clearly,look forward your help the new questiong link is https://stackoverflow.com/questions/44113232/how-to-auto-launch-new-task-instance-when-mesos-slave-stopped – Jackie May 22 '17 at 12:47

how to reduce task kill period time when task state is TASK_LOST?

1 Answers1