Software:
Marathon 1.1.1, Mesos 0.28.1
Issue:
On occasion we've noticed a very low resource offering coming from Mesos to Marathon which results in applications getting stuck in a "WAITING" state. On the slave node system resources are determined by Mesos and are not controlled by 'resources' flag.
Recent example:
Slave with 16GB of memory. 2 docker containers running on it with a total of 6GB of mem allocated to them. Actual usage was ~400MB. Getting on the box and checking free memory I saw ~9GB available on the machine. in the offers for that slave in Marathon I saw a bit under 300MB available and since the container I wanted to deploy required 2GB the application deployment got stuck. Restarting the slave cleared up the issue.
I've looked at the code for determining available memory and it does not have any complex logic that might have explained this behavior (master/src/slave/containerizer/containerizer.cpp).
Has anyone who observed similar behavior has any suggestions on how I can improve the setup?
Log (ihlworkerslave1 is the node in question)
Jun 8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7322]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7322" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S2" } hostname: "ihlworkerslave1" resources { name: "ports" type: RANGES ranges { range { begin: 2552 end: 2552 } range { begin: 8888 end: 8889 } range { begin: 31000 end: 31021 } range { begin: 31024 end: 31419 } range { begin: 31423 end: 31709 } range { begin: 31711 end: 31907 } range { begin: 31909 end: 32000 } } role: "*" } resources { name: "cpus" type: SCALAR scalar { value: 3.1 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 270.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } attributes { name: "layer" type: TEXT t
Jun 8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7323]. Considering unreserved resources with roles {*}. Couldn't find host port 2552 (of 2552, 8888, 8889) in any offered range for app [/hc-manager] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-296)
Jun 8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7323]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7323" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S3" } hostname: "ihlworkerslave2" resources { name: "cpus" type: SCALAR scalar { value: 3.9 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 12558.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 31678 } range { begin: 31682 end: 32000 } } role: "*" } attributes { name: "layer" type: TEXT text { value: "worker" } } url { scheme: "http" address { hostname: "ihlworkerslave2." ip: "10.184.245.125" port: 5051 } path: "/slave(
Jun 8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7324]. Considering unreserved resources with roles {*}. Couldn't find host port 2552 (of 2552, 8888, 8889) in any offered range for app [/hc-manager] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-296)
Jun 8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7324]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7324" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S6" } hostname: "ihlworkerslave4" resources { name: "cpus" type: SCALAR scalar { value: 4.0 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 14606.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 32000 } } role: "*" } attributes { name: "layer" type: TEXT text { value: "worker" } } url { scheme: "http" address { hostname: "ihlworkerslave4." ip: "10.184.245.145" port: 5051 } path: "/slave(1)" }] (mesosphere.mesos.TaskBuild
Jun 8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7325]. Considering unreserved resources with roles {*}. Couldn't find host port 2552 (of 2552, 8888, 8889) in any offered range for app [/hc-manager] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-296)
Jun 8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7325]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7325" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S4" } hostname: "ihlworkerslave3" resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 32000 } } role: "*" } resources { name: "cpus" type: SCALAR scalar { value: 3.8 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 10510.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } attributes { name: "layer" type: TEXT text { value: "worker" } } attributes { name: "akkaseeder" type: TEXT text { value: "true" } } url { scheme: "http" address { hostname: "ihlworkerslave3." ip: "10.184.24
Jun 8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7326]. Considering unreserved resources with roles {*}. Couldn't find host port 2552 (of 2552, 8888, 8889) in any offered range for app [/hc-manager] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-296)
Jun 8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7326]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7326" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S5" } hostname: "ihlworkerslave5" resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 32000 } } role: "*" } resources { name: "cpus" type: SCALAR scalar { value: 3.8 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 10510.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } attributes { name: "layer" type: TEXT text { value: "worker" } } attributes { name: "akkaseeder" type: TEXT text { value: "true" } } url { scheme: "http" address { hostname: "ihlworkerslave5." ip: "10.184.24
Thank you!