0

Software:

Marathon 1.1.1, Mesos 0.28.1

Issue:

On occasion we've noticed a very low resource offering coming from Mesos to Marathon which results in applications getting stuck in a "WAITING" state. On the slave node system resources are determined by Mesos and are not controlled by 'resources' flag.

Recent example:

Slave with 16GB of memory. 2 docker containers running on it with a total of 6GB of mem allocated to them. Actual usage was ~400MB. Getting on the box and checking free memory I saw ~9GB available on the machine. in the offers for that slave in Marathon I saw a bit under 300MB available and since the container I wanted to deploy required 2GB the application deployment got stuck. Restarting the slave cleared up the issue.

I've looked at the code for determining available memory and it does not have any complex logic that might have explained this behavior (master/src/slave/containerizer/containerizer.cpp).

Has anyone who observed similar behavior has any suggestions on how I can improve the setup?

Log (ihlworkerslave1 is the node in question)

Jun  8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7322]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7322" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S2" } hostname: "ihlworkerslave1" resources { name: "ports" type: RANGES ranges { range { begin: 2552 end: 2552 } range { begin: 8888 end: 8889 } range { begin: 31000 end: 31021 } range { begin: 31024 end: 31419 } range { begin: 31423 end: 31709 } range { begin: 31711 end: 31907 } range { begin: 31909 end: 32000 } } role: "*" } resources { name: "cpus" type: SCALAR scalar { value: 3.1 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 270.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } attributes { name: "layer" type: TEXT t
Jun  8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7323]. Considering unreserved resources with roles {*}. Couldn't find host port 2552 (of 2552, 8888, 8889) in any offered range for app [/hc-manager] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-296)
Jun  8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7323]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7323" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S3" } hostname: "ihlworkerslave2" resources { name: "cpus" type: SCALAR scalar { value: 3.9 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 12558.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 31678 } range { begin: 31682 end: 32000 } } role: "*" } attributes { name: "layer" type: TEXT text { value: "worker" } } url { scheme: "http" address { hostname: "ihlworkerslave2." ip: "10.184.245.125" port: 5051 } path: "/slave(
Jun  8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7324]. Considering unreserved resources with roles {*}. Couldn't find host port 2552 (of 2552, 8888, 8889) in any offered range for app [/hc-manager] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-296)
Jun  8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7324]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7324" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S6" } hostname: "ihlworkerslave4" resources { name: "cpus" type: SCALAR scalar { value: 4.0 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 14606.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 32000 } } role: "*" } attributes { name: "layer" type: TEXT text { value: "worker" } } url { scheme: "http" address { hostname: "ihlworkerslave4." ip: "10.184.245.145" port: 5051 } path: "/slave(1)" }] (mesosphere.mesos.TaskBuild
Jun  8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7325]. Considering unreserved resources with roles {*}. Couldn't find host port 2552 (of 2552, 8888, 8889) in any offered range for app [/hc-manager] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-296)
Jun  8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7325]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7325" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S4" } hostname: "ihlworkerslave3" resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 32000 } } role: "*" } resources { name: "cpus" type: SCALAR scalar { value: 3.8 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 10510.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } attributes { name: "layer" type: TEXT text { value: "worker" } } attributes { name: "akkaseeder" type: TEXT text { value: "true" } } url { scheme: "http" address { hostname: "ihlworkerslave3." ip: "10.184.24
Jun  8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7326]. Considering unreserved resources with roles {*}. Couldn't find host port 2552 (of 2552, 8888, 8889) in any offered range for app [/hc-manager] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-296)
Jun  8 02:25:46 ihlmaster1 marathon[4805]: [2016-06-08 02:25:46,735] INFO Offer [a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7326]. Insufficient resources for [/hc-manager] (need cpus=0.1, mem=2048.0, disk=0.0, ports=([2552, 8888, 8889] required), available in offer: [id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-O7326" } framework_id { value: "d41742e6-1e26-4172-b687-9692c8d34732-0000" } slave_id { value: "a78f4eb9-382b-4a14-9d0f-43a032da5e57-S5" } hostname: "ihlworkerslave5" resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 32000 } } role: "*" } resources { name: "cpus" type: SCALAR scalar { value: 3.8 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 10510.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 5112.0 } role: "*" } attributes { name: "layer" type: TEXT text { value: "worker" } } attributes { name: "akkaseeder" type: TEXT text { value: "true" } } url { scheme: "http" address { hostname: "ihlworkerslave5." ip: "10.184.24

Thank you!

janisz
  • 6,292
  • 4
  • 37
  • 70
dlyub
  • 11
  • 4
  • Could you post logs with offers? It could be extracted from marathon log and agent. – janisz Jun 09 '16 at 16:01
  • Absolutely, first post updated. Take a look at worker1. Other nodes are not suitable for placement of this particular container for other (expected) reasons, but you can see that they advertise much more resources than the first node. (all are utilizing identical EC2 instances). – dlyub Jun 09 '16 at 17:06
  • Form your description worker1 should offer about 16-(2*6)=4G of RAM. I suspect offers mechanism is working correctly but somehow Mesos doesn't see memory. Could you check memory reported by worker1? [metrics/snapshot](http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/) [monitor/statistics](http://mesos.apache.org/documentation/latest/endpoints/slave/monitor/statistics/) [slave/state](http://mesos.apache.org/documentation/latest/endpoints/slave/state/) could be useful. – janisz Jun 11 '16 at 22:27
  • Don't think your formula is accurate, was expecting something more like 16GB - 1 GB - 6GB = 9GB. 'free' indicated the amount of memory on the node was in line with my expectations. Seen a couple of other posts with people reporting similar observations, but no solutions have been found yet. – dlyub Jun 13 '16 at 19:06
  • I'm sorry I must misunderstood you and assumed 6GB takes each container not both of them. Maybe you hit [MESOS-5380](https://issues.apache.org/jira/browse/MESOS-5380) and your executors was still alive and taking resources, but not running any application. – janisz Jun 13 '16 at 19:13

0 Answers0