0

I have a 3 nodes setup running Marathon, mesos-master,mesos-slave and Zookeeper with HA config enabled, then tested a deployment of simple hello app using mesos-execute and it's working as expected.

Now everything looks fine, so I connect to Marathon and deploy a simple app to test marathon: (echo "hello" >> /tmp/output.txt) but the application get sucked in "waiting" status.

what could be the problem preventing Marathon to use mesos resources for deployment ?

Logs from mesos-master:

I0904 11:23:27.064332 19769 master.cpp:2813] Received SUBSCRIBE call for framework 'marathon' at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324
I0904 11:23:27.064623 19769 master.cpp:2890] Subscribing framework marathon with checkpointing enabled and capabilities [ PARTITION_AWARE ]
I0904 11:23:27.064669 19769 master.cpp:6272] Updating info for framework cb16118a-2257-4020-a907-63aa6294e11b-0000
I0904 11:23:27.064697 19769 master.cpp:2994] Framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324 failed over
I0904 11:23:27.065032 19770 hierarchical.cpp:342] Activated framework cb16118a-2257-4020-a907-63aa6294e11b-0000
I0904 11:23:27.065465 19770 master.cpp:7305] Sending 3 offers to framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324
I0904 11:23:27.907865 19769 http.cpp:1115] HTTP GET for /files/read?_=1504517007920&jsonp=jQuery17109098185077823333_1504516979864&length=50000&offset=352538&path=%2Fmaster%2Flog from 192.168.40.1:53525 with User-Agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
I0904 11:23:28.916651 19768 http.cpp:1115] HTTP GET for /files/read?_=1504517008930&jsonp=jQuery17109098185077823333_1504516979865&length=50000&offset=353797&path=%2Fmaster%2Flog from 192.168.40.1:53525 with User-Agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
E0904 11:23:30.071293 19775 process.cpp:2450] Failed to shutdown socket with fd 39, address 192.168.40.159:58072: Transport endpoint is not connected
I0904 11:23:30.073277 19768 master.cpp:1430] Framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324 disconnected
I0904 11:23:30.073307 19768 master.cpp:3160] Deactivating framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324
I0904 11:23:30.073485 19768 master.cpp:3137] Disconnecting framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324
I0904 11:23:30.073496 19768 master.cpp:1445] Giving framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324 1weeks to failover
I0904 11:23:30.073519 19768 hierarchical.cpp:374] Deactivated framework cb16118a-2257-4020-a907-63aa6294e11b-0000

curl -XGET 'http://mesosphere2:8098/v2/queue?pretty' | jq

{
  "queue": [
    {
      "count": 1,
      "delay": {
        "timeLeftSeconds": 0,
        "overdue": true
      },
      "since": "2017-09-04T13:12:42.024Z",
      "processedOffersSummary": {
        "processedOffersCount": 12,
        "unusedOffersCount": 12,
        "lastUnusedOfferAt": "2017-09-04T13:14:52.554Z",
        "rejectSummaryLastOffers": [
          {
            "reason": "UnfulfilledRole",
            "declined": 3,
            "processed": 3
          },
          {
            "reason": "UnfulfilledConstraint",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "NoCorrespondingReservationFound",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientCpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientMemory",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientDisk",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientGpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientPorts",
            "declined": 0,
            "processed": 0
          }
        ],
        "rejectSummaryLaunchAttempt": [
          {
            "reason": "UnfulfilledRole",
            "declined": 12,
            "processed": 12
          },
          {
            "reason": "UnfulfilledConstraint",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "NoCorrespondingReservationFound",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientCpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientMemory",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientDisk",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientGpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientPorts",
            "declined": 0,
            "processed": 0
          }
        ]
      },
      "app": {
        "id": "/test03",
        "acceptedResourceRoles": [
          "slave_public"
        ],
        "backoffFactor": 1.15,
        "backoffSeconds": 1,
        "container": {
          "type": "DOCKER",
          "docker": {
            "forcePullImage": false,
            "image": "laghao/hello-marathon",
            "network": "BRIDGE",
            "parameters": [],
            "portMappings": [
              {
                "containerPort": 80,
                "hostPort": 80,
                "labels": {},
                "protocol": "tcp",
                "servicePort": 10003
              }
            ],
            "privileged": false
          },
          "volumes": []
        },
        "cpus": 0.1,
        "disk": 0,
        "executor": "",
        "instances": 1,
        "labels": {},
        "maxLaunchDelaySeconds": 3600,
        "mem": 64,
        "gpus": 0,
        "portDefinitions": [
          {
            "port": 10003,
            "name": "default",
            "protocol": "tcp"
          }
        ],
        "requirePorts": false,
        "upgradeStrategy": {
          "maximumOverCapacity": 1,
          "minimumHealthCapacity": 1
        },
        "version": "2017-09-04T13:12:41.993Z",
        "versionInfo": {
          "lastScalingAt": "2017-09-04T13:12:41.993Z",
          "lastConfigChangeAt": "2017-09-04T13:12:41.993Z"
        },
        "killSelection": "YOUNGEST_FIRST",
        "unreachableStrategy": {
          "inactiveAfterSeconds": 300,
          "expungeAfterSeconds": 600
        }
      }
    }
  ]
}
Oussema Benlagha
  • 51
  • 1
  • 1
  • 12
  • Can you show Marathons logs? _waiting_ means there are no resources available to meet application constraints. In latest Marathon 1.4+ you can debug what resources are missing for given deployment with [/v2/queue endpoint](https://mesosphere.github.io/marathon/docs/generated/api.html#v2_queue_get). – janisz Sep 04 '17 at 12:31

1 Answers1

0

From documentation

An app stays in “Waiting” forever This means that Marathon does not receive “Resource Offers” from Mesos that allow it to start tasks of this application. The simplest failure is that there are not sufficient resources available in the cluster or another framework hords all these resources. You can check the Mesos UI for available resources. Note that the required resources (such as CPU, Mem, Disk) have to be all available on a single host.

If you do not find the solution yourself and you create a GitHub issue, please append the output of Mesos /state endpoint to the bug report so that we can inspect available cluster resources.

In your case there is a problem with application role requirement and agent role. You can deduce it from UnfulfilledRole.

Marathon 1.4 introduced information about stuck deployments. You can query /v2/queue and get statistics why offers were declined.

Community
  • 1
  • 1
janisz
  • 6,292
  • 4
  • 37
  • 70
  • well I read that thread about "waiting" status but resources are available as I can deploy through mesos directly so the problem is somehow between mesos-marathon communication, a thread is opened as well in Marathon Group: and the /v2/queue is posted there: https://groups.google.com/forum/#!topic/marathon-framework/r1aKkRXIXAE – Oussema Benlagha Sep 04 '17 at 13:23
  • It looks like the problem is with roles. Can you show you application json and agents configuration. – janisz Sep 04 '17 at 14:44
  • You are right - I changed the deployment scripts again & you can check it in the Group, can you deploy it and give me a feedback ? – Oussema Benlagha Sep 04 '17 at 15:06
  • What is the question? Could you rephrase it so? – janisz Sep 04 '17 at 20:53
  • I fixed the roles problem, it was `"acceptedResourceRoles": ["slave_public"],` and I erase that line, but the application is still on "waiting" status. – Oussema Benlagha Sep 05 '17 at 06:28
  • What `/v2/queue` returns now? – janisz Sep 05 '17 at 07:01
  • Can you delete constraints? It looks like there is a problem with it `UnfulfilledConstraint: 1` – janisz Sep 05 '17 at 11:05