0

I have one Mesos Master, and 2 Mesos Slaves,

The 2 slaves are located:

  • 1 in my LAN (mesos-slave-on-LAN)
  • 1 in a public Cloud (mesos-slave-on-WAN)

Master can be reached:

  • On LAN at: 10.1.10.175
  • On WAN at: 94.141.153.57

I can see the 2 Mesos Slaves registered in Mesos.

I try to execute two types of dummy tasks:

  • A simple Python command

    {
      "id": "my-first-app",
      "cmd": "python -m SimpleHTTPServer 8009",
      "cpus": 0.01,
      "mem": 20.0,
      "instances": 1,
      "acceptedResourceRoles": [
        "slave_public",
        "*"
      ],
      "ports": [8009],
      "requirePorts": true
    }
    
  • A Docker container deployement (with Nginx inside)

    {
      "id": "nginx-test",
      "container": {
        "docker": {
          "image": "nginx",
          "network": "BRIDGE",
          "portMappings": [{
            "containerPort": 80,
            "hostPort": 0,
            "servicePort": 80,
            "protocol": "tcp"
          }]
          },
          "type": "DOCKER",
          "volumes": []
        },
      "healthChecks": [{
        "protocol": "HTTP",
        "portIndex": 0,
        "path": "/",
        "gracePeriodSeconds": 5,
        "intervalSeconds": 20,
        "maxConsecutiveFailures": 3
      }],
    "cpus": 0.2,
    "mem": 32.0,
    "instances": 1
    

    }

Case 1: Mesos-Master advertising the LAN IP:

Mesos Master:

/usr/sbin/mesos-master --work_dir=/var/lib/mesos --zk=zk://mesos_machine:2181/mesos --quorum=1 --log_dir=/var/log/mesos --external_log_file=/dev/stdout --advertise_ip=10.1.10.175

Mesos Slave on LAN:

/usr/sbin/mesos-slave --master=10.1.10.175:5050 --work_dir=/var/lib/mesos/agent  --containerizers=docker,mesos  --executor_registration_timeout=3mins --log_dir=/var/log/mesos

Mesos Slave on WAN:

/usr/sbin/mesos-slave --master=94.141.153.57:5050 --work_dir=/var/lib/mesos/agent --containerizers=docker,mesos --executor_registration_timeout=3mins --log_dir=/var/log/mesos

I get this matrix when I run the above confs, and stop successfully one of the slaves:

|           | LAN Slave     | Cloud Slave   |
|--------   |-----------    |-------------  |
| Python    | 'Waiting'     | 'Waiting'     |
| Docker    | 'RUNNING'     | 'Waiting'     |

Here the Nginx App has been deployed by Marathon on mesos-slave-on-LAN

Case 2: Mesos-Master advertising the WAN IP:

Mesos Master:

/usr/sbin/mesos-master --work_dir=/var/lib/mesos --zk=zk://mesos_machine:2181/mesos --quorum=1 --log_dir=/var/log/mesos --external_log_file=/dev/stdout --advertise_ip=94.141.153.57

Mesos Slave on LAN (using master LAN ip, otherwise is not added in Mesos):

/usr/sbin/mesos-slave --master=10.1.10.175:5050 --work_dir=/var/lib/mesos/agent  --containerizers=docker,mesos  --executor_registration_timeout=3mins --log_dir=/var/log/mesos   --advertise_ip=10.1.10.20

Mesos Slave on WAN:

/usr/sbin/mesos-slave --master=94.141.153.57:5050 --work_dir=/var/lib/mesos/agent --containerizers=docker,mesos --executor_registration_timeout=3mins --log_dir=/var/log/mesos

I get this matrix when I run the above confs, and stop successfully one of the slaves:

|           | LAN Slave     | Cloud Slave   |
|--------   |-----------    |-------------  |
| Python    | 'Waiting'     | 'Waiting'     |
| Docker    | 'Waiting'     | 'Waiting'     |

Here the Nginx App has not been deployed by Marathon on mesos-slave-on-LAN

However the 2 slaves are visible as resource in Mesos webui.

How can I be able to deploy 'Python' and 'Docker' container in a LAN as well as in a WAN slave?

Logs of Marathon are:

25596:[2016-12-13 15:26:52,393] INFO [/nginx-test-n2]: new app detected (mesosphere.marathon.upgrade.GroupVersioningUtil$:marathon-akka.actor.default-dispatcher-1064)
25600:  * Start(App(/nginx-test-n2, image="nginx")), instances=0)
25602:  * Scale(App(/nginx-test-n2, image="nginx")), instances=1)
25607:  * Start(App(/nginx-test-n2, image="nginx")), instances=0)
25609:  * Scale(App(/nginx-test-n2, image="nginx")), instances=1)
25611:[2016-12-13 15:26:52,400] INFO [/nginx-test-n2] storing new app version 2016-12-13T14:26:52.388Z (mesosphere.marathon.core.group.impl.GroupManagerActor:marathon-akka.actor.default-dispatcher-1028)
25612:[2016-12-13 15:26:52,417] INFO Adding health check for app [/nginx-test-n2] and version [2016-12-13T14:26:52.388Z]: [HealthCheck(Some(/),HTTP,Some(0),None,5 seconds,20 seconds,20 seconds,3,false,None)] (mesosphere.marathon.core.health.impl.MarathonHealthCheckManager:marathon-akka.actor.default-dispatcher-1073)
25613:[2016-12-13 15:26:52,417] INFO Starting app /nginx-test-n2 (mesosphere.marathon.SchedulerActions:marathon-akka.actor.default-dispatcher-1073)
25614:[2016-12-13 15:26:52,417] INFO Starting health check actor for app [/nginx-test-n2] version [2016-12-13T14:26:52.388Z] and healthCheck [HealthCheck(Some(/),HTTP,Some(0),None,5 seconds,20 seconds,20 seconds,3,false,None)] (mesosphere.marathon.core.health.impl.HealthCheckActor:marathon-akka.actor.default-dispatcher-1075)
25615:[2016-12-13 15:26:52,417] INFO Already running 0 instances of /nginx-test-n2. Not scaling. (mesosphere.marathon.SchedulerActions:marathon-akka.actor.default-dispatcher-1073)
25616:[2016-12-13 15:26:52,418] INFO Successfully started 0 instances of /nginx-test-n2 (mesosphere.marathon.upgrade.AppStartActor:marathon-akka.actor.default-dispatcher-1073)
25617:[2016-12-13 15:26:52,418] INFO Started taskLaunchActor for /nginx-test-n2 version 2016-12-13T14:26:52.388Z with initial count 1 (mesosphere.marathon.core.launchqueue.impl.TaskLauncherActor:marathon-akka.actor.default-dispatcher-1028)
25618:[2016-12-13 15:26:52,419] INFO activating matcher ActorOfferMatcher(Actor[akka://marathon/user/launchQueue/5/6-nginx-test-n2#-69306758]). (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-1071)
25627:[2016-12-13 15:26:52,425] INFO Offer [bd40f00f-ce24-4014-b1b1-82db64e68c10-O91]. Considering resources with roles {*} without resident reservation labels. Insufficient ports in offer for run spec [/nginx-test-n2] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-1073)
25628:[2016-12-13 15:26:52,425] INFO Offer [bd40f00f-ce24-4014-b1b1-82db64e68c10-O91]. Insufficient resources for [/nginx-test-n2] (need cpus=0.2, mem=32.0, disk=0.0, gpus=0, ports=(), available in offer: [id { value: "bd40f00f-ce24-4014-b1b1-82db64e68c10-O91" } framework_id { value: "40aadcc7-8e0f-4634-af46-29d9c33bc03e-0000" } slave_id { value: "bd40f00f-ce24-4014-b1b1-82db64e68c10-S1" } hostname: "myproj-slave-vm-1" resources { name: "disk" type: SCALAR scalar { value: 3985.0 } role: "*" } resources { name: "cpus" type: SCALAR scalar { value: 0.6 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 6478.0 } role: "*" } url { scheme: "http" address { hostname: "myproj-slave-vm-1" ip: "10.1.10.20" port: 5051 } path: "/slave(1)" }] (mesosphere.mesos.TaskBuilder$:marathon-akka.actor.default-dispatcher-1073)
25634:  * Start(App(/nginx-test-n2, image="nginx")), instances=0)
25636:  * Scale(App(/nginx-test-n2, image="nginx")), instances=1)
25646:[2016-12-13 15:26:57,440] INFO Offer [bd40f00f-ce24-4014-b1b1-82db64e68c10-O92]. Considering resources with roles {*} without resident reservation labels. Insufficient ports in offer for run spec [/nginx-test-n2] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-1073)
25647:[2016-12-13 15:26:57,440] INFO Offer [bd40f00f-ce24-4014-b1b1-82db64e68c10-O92]. Insufficient resources for [/nginx-test-n2] (need cpus=0.2, mem=32.0, disk=0.0, gpus=0, ports=(), available in offer: [id { value: "bd40f00f-ce24-4014-b1b1-82db64e68c10-O92" } framework_id { value: "40aadcc7-8e0f-4634-af46-29d9c33bc03e-0000" } slave_id { value: "bd40f00f-ce24-4014-b1b1-82db64e68c10-S1" } hostname: "myproj-slave-vm-1" resources { name: "disk" type: SCALAR scalar { value: 3985.0 } role: "*" } resources { name: "cpus" type: SCALAR scalar { value: 0.6 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 6478.0 } role: "*" } url { scheme: "http" address { hostname: "myproj-slave-vm-1" ip: "10.1.10.20" port: 5051 } path: "/slave(1)" }] (mesosphere.mesos.TaskBuilder$:marathon-akka.actor.default-dispatcher-1073)
25660:[2016-12-13 15:27:02,457] INFO Offer [bd40f00f-ce24-4014-b1b1-82db64e68c10-O93]. Considering resources with roles {*} without resident reservation labels. Insufficient ports in offer for run spec [/nginx-test-n2] (mesosphere.marathon.tasks.PortsMatcher:marathon-akka.actor.default-dispatcher-1028)
25661:[2016-12-13 15:27:02,457] INFO Offer [bd40f00f-ce2

I have also this kind of log, with no deployement and no error:

31333:[2016-12-13 16:00:39,975] INFO [/nginx-test10]: new app detected (mesosphere.marathon.upgrade.GroupVersioningUtil$:marathon-akka.actor.default-dispatcher-1139)
31337:  * Start(App(/nginx-test10, image="nginx")), instances=0)
31339:  * Scale(App(/nginx-test10, image="nginx")), instances=1)
31344:  * Start(App(/nginx-test10, image="nginx")), instances=0)
31346:  * Scale(App(/nginx-test10, image="nginx")), instances=1)
31348:[2016-12-13 16:00:39,979] INFO [/nginx-test10] storing new app version 2016-12-13T15:00:39.974Z (mesosphere.marathon.core.group.impl.GroupManagerActor:marathon-akka.actor.default-dispatcher-1101)
31349:[2016-12-13 16:00:39,981] INFO Adding health check for app [/nginx-test10] and version [2016-12-13T15:00:39.974Z]: [HealthCheck(Some(/),HTTP,Some(0),None,5 seconds,20 seconds,20 seconds,3,false,None)] (mesosphere.marathon.core.health.impl.MarathonHealthCheckManager:marathon-akka.actor.default-dispatcher-1141)
31350:[2016-12-13 16:00:39,982] INFO Starting app /nginx-test10 (mesosphere.marathon.SchedulerActions:marathon-akka.actor.default-dispatcher-1141)
31351:[2016-12-13 16:00:39,982] INFO Starting health check actor for app [/nginx-test10] version [2016-12-13T15:00:39.974Z] and healthCheck [HealthCheck(Some(/),HTTP,Some(0),None,5 seconds,20 seconds,20 seconds,3,false,None)] (mesosphere.marathon.core.health.impl.HealthCheckActor:marathon-akka.actor.default-dispatcher-1139)
31352:[2016-12-13 16:00:39,982] INFO Already running 0 instances of /nginx-test10. Not scaling. (mesosphere.marathon.SchedulerActions:marathon-akka.actor.default-dispatcher-1141)
31353:[2016-12-13 16:00:39,982] INFO Successfully started 0 instances of /nginx-test10 (mesosphere.marathon.upgrade.AppStartActor:marathon-akka.actor.default-dispatcher-1141)
31354:[2016-12-13 16:00:39,983] INFO Started taskLaunchActor for /nginx-test10 version 2016-12-13T15:00:39.974Z with initial count 1 (mesosphere.marathon.core.launchqueue.impl.TaskLauncherActor:marathon-akka.actor.default-dispatcher-1110)
31355:[2016-12-13 16:00:39,983] INFO activating matcher ActorOfferMatcher(Actor[akka://marathon/user/launchQueue/6/2-nginx-test10#1135134700]). (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-1140)
31364:  * Start(App(/nginx-test10, image="nginx")), instances=0)
31366:  * Scale(App(/nginx-test10, image="nginx")), instances=1)

The resources in Mesos appear like this (only the WAN slave appears, the LAN slave not, i dont understand why):

|         | CPUs | GPUs | Mem    | Disk    |
|---------|------|------|--------|---------|
| Total   | 1    | 0    | 244 MB | 43.6 GB |
| Used    | 0    | 0    | 0 B    | 0 B     |
| Offered | 0    | 0    | 0 B    | 0 B     |
| Idle    | 1    | 0    | 244 MB | 43.6 GB |
matt
  • 1,046
  • 1
  • 13
  • 26
  • What's in the logs? – janisz Dec 08 '16 at 22:15
  • Hi janisz, I have add some Marathon logs in the question: Seem that it can be an 'Insufficient resources' problem, but I cant figure why its not accepted since offering is more than what is needed... – matt Dec 13 '16 at 14:36
  • How your resources looks from Mesos side? – janisz Dec 13 '16 at 14:58
  • I added the resource in the question: Indeed my LAN slave isnt caught by Mesos when I use the public IP of Mesos (not the LAN IP)... I think this part is related to a firewall problem... – matt Dec 13 '16 at 15:03
  • I can see also this line in the Mesos Slave (in WAN) logs: **No credentials provided. Attempting to register without authentication**, and then **Re-registered with master master@94.141.153.57:5050** Is it mandatory to register with authentication when slaves are public? – matt Dec 13 '16 at 15:34

0 Answers0