Marathon tasks not migrating off mesos node goes into draining mode

Question

From the documentation, it seems that when a node goes into mesos node goes into maintenance mode, it sends inverse offers to all of the frameworks. My interpretation of that is that frameworks, such as Marathon, should receive those inverse offers and work to migrate tasks off of the node scheduled for maintence.

I schedule maintenance for 60 seconds from now using the API:

curl -X POST leader.mesos:5050/maintenance/schedule \
  --data '{"windows": [{"machine_ids":[{"hostname": "host43.local"}], "unavailability": {"start": {"nanoseconds": '$(($(date +%s) + 60))'000000000}, "duration": {"nanoseconds": 3600000000000}}}]}'

Then, I query the maintenance status and can confirm that it is draining:

$ curl leader.mesos:5050/maintenance/status | jq .
{
  "draining_machines": [
    {
      "id": {
        "hostname": "host43.local"
      }
    }
  ]
}

Finally, once the window approaches, I down it:

curl -X POST leader.mesos:5050/machine/down --data '[{"hostname": "host43.local"}]'

I confirm that it took effect:

$ curl leader.mesos:5050/maintenance/status | jq .
{
  "down_machines": [
    {
      "hostname": "hsot43.local"
    }
  ]
}

Then, I check marathon (via the UI), and see that there are still tasks running on host43.local.

I see this error message in the marathon logs, and I wonder if it is related:

May 12 11:46:02 host43.local start[126170]: [2016-05-12 11:46:02,581] ERROR not currently active (Actor[akka://marathon/user/taskTracker#-1732573467]) (akka.actor.OneForOneStrategy:marathon-akka.actor.default-dispatcher-17)
May 12 11:46:02 host43.local start[126170]: java.lang.IllegalStateException: not currently active (Actor[akka://marathon/user/taskTracker#-1732573467])
May 12 11:46:02 host43.local start[126170]: at mesosphere.marathon.core.leadership.impl.WhenLeaderActor$$anonfun$1.applyOrElse(WhenLeaderActor.scala:38) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at akka.actor.Actor$class.aroundReceive(Actor.scala:465) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at mesosphere.marathon.core.leadership.impl.WhenLeaderActor.aroundReceive(WhenLeaderActor.scala:20) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at akka.actor.ActorCell.invoke(ActorCell.scala:487) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at akka.dispatch.Mailbox.run(Mailbox.scala:221) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at akka.dispatch.Mailbox.exec(Mailbox.scala:231) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [marathon-assembly-1.1.1.jar:1.1.1]
May 12 11:46:02 host43.local start[126170]: [2016-05-12 11:46:02,581] INFO Killing 1 instances from 1 (mesosphere.marathon.upgrade.TaskKillActor:marathon-akka.actor.default-dispatcher-17)

If I manually kill the tasks with marathon, they don't appear to get allocated on the node undergoing maintenance. It seems like the behavior should be that nodes are automatically migrated off, and I don't know what I'm doing wrong, or if I've encountered a bug, or if I am misinterpreting the documentation and the expected behavior.

Running Marathon 1.1.1 and Mesos 0.28

score 1 · Answer 1 · edited Mar 27 '17 at 22:28

1

Received an answer from the DC/OS slack chat room and posting it here for the benefit of others. Marathon does not yet support Mesos maintenance primitives.

The following JIRA ticket tracks the feature:

https://jira.mesosphere.com/browse/MARATHON-3216

edited Mar 27 '17 at 22:28

Matthew Sharp

127
5

answered May 12 '16 at 17:52

Tim Harper

2,561
20
23

Marathon tasks not migrating off mesos node goes into draining mode

1 Answers1