Mesos Marathon apps with persistent volume apps stuck at suspended

Question

I'm having trouble running an app in Marathon using persistent local volumes. Having followed the instructions, starting Marathon with a role and principal and creating a simple app with a persistent volume, it just hangs at suspended. It seems that the slave has responded with a valid offer, but can't actually start up the app. The slave doesn't log anything regarding the task, even when I compile with the debug option and turn logging right up with GLOG_v=2.

Also it seems that Marathon is constantly rolling the task ID as it failing to start, but I can't see why anywhere.

Oddly when I run without persistent volume, but with disk reservation the app starts running.

The debug logging on Marathon doesn't appear to be showing anything useful, however I could be missing something. Could anyone give me any pointers as to what the problem may be or where to look for additional debug? Many thanks in advance .

Here's some info about my environment and debug info:

Slave: Ubuntu 14.04 running 0.28 prebuilt and tested in 0.29 built from source

Master: Mesos 0.28 running inside a Docker Ubuntu 14.04 image on CoreOS

Marathon: 1.1.1 running inside a Docker Ubuntu 14.04 image on CoreOS

App with persistent storage

App info from v2/apps/test/tasks on Marathon

{
  "app": {
    "id": "/test",
    "cmd": "while true; do sleep 10; done",
    "args": null,
    "user": null,
    "env": {},
    "instances": 1,
    "cpus": 1,
    "mem": 128,
    "disk": 0,
    "executor": "",
    "constraints": [
      [
        "role",
        "CLUSTER",
        "persistent"
      ]
    ],
    "uris": [],
    "fetch": [],
    "storeUrls": [],
    "ports": [
      10002
    ],
    "portDefinitions": [
      {
        "port": 10002,
        "protocol": "tcp",
        "labels": {}
      }
    ],
    "requirePorts": false,
    "backoffSeconds": 1,
    "backoffFactor": 1.15,
    "maxLaunchDelaySeconds": 3600,
    "container": {
      "type": "MESOS",
      "volumes": [
        {
          "containerPath": "test",
          "mode": "RW",
          "persistent": {
            "size": 100
          }
        }
      ]
    },
    "healthChecks": [],
    "readinessChecks": [],
    "dependencies": [],
    "upgradeStrategy": {
      "minimumHealthCapacity": 0.5,
      "maximumOverCapacity": 0
    },
    "labels": {},
    "acceptedResourceRoles": null,
    "ipAddress": null,
    "version": "2016-05-19T11:31:54.861Z",
    "residency": {
      "relaunchEscalationTimeoutSeconds": 3600,
      "taskLostBehavior": "WAIT_FOREVER"
    },
    "versionInfo": {
      "lastScalingAt": "2016-05-19T11:31:54.861Z",
      "lastConfigChangeAt": "2016-05-18T16:46:59.684Z"
    },
    "tasksStaged": 0,
    "tasksRunning": 0,
    "tasksHealthy": 0,
    "tasksUnhealthy": 0,
    "deployments": [
      {
        "id": "4f3779e5-a805-4b95-9065-f3cf9c90c8fe"
      }
    ],
    "tasks": [
      {
        "id": "test.4b7d4303-1dc2-11e6-a179-a2bd870b1e9c",
        "slaveId": "9f7c6ed5-4bf5-475d-9311-05d21628604e-S17",
        "host": "ip-10-0-90-61.eu-west-1.compute.internal",
        "localVolumes": [
          {
            "containerPath": "test",
            "persistenceId": "test#test#4b7d4302-1dc2-11e6-a179-a2bd870b1e9c"
          }
        ],
        "appId": "/test"
      }
    ]
  }
}

App info in Marathon: (it seems the deployment is spinning)

Stuck at waiting instance info (screenshot)

App without persistent storage

App info from v2/apps/test2/tasks on Marathon

{
  "app": {
    "id": "/test2",
    "cmd": "while true; do sleep 10; done",
    "args": null,
    "user": null,
    "env": {},
    "instances": 1,
    "cpus": 1,
    "mem": 128,
    "disk": 100,
    "executor": "",
    "constraints": [
      [
        "role",
        "CLUSTER",
        "persistent"
      ]
    ],
    "uris": [],
    "fetch": [],
    "storeUrls": [],
    "ports": [
      10002
    ],
    "portDefinitions": [
      {
        "port": 10002,
        "protocol": "tcp",
        "labels": {}
      }
    ],
    "requirePorts": false,
    "backoffSeconds": 1,
    "backoffFactor": 1.15,
    "maxLaunchDelaySeconds": 3600,
    "container": null,
    "healthChecks": [],
    "readinessChecks": [],
    "dependencies": [],
    "upgradeStrategy": {
      "minimumHealthCapacity": 0.5,
      "maximumOverCapacity": 0
    },
    "labels": {},
    "acceptedResourceRoles": null,
    "ipAddress": null,
    "version": "2016-05-19T13:44:01.831Z",
    "residency": null,
    "versionInfo": {
      "lastScalingAt": "2016-05-19T13:44:01.831Z",
      "lastConfigChangeAt": "2016-05-19T13:09:20.106Z"
    },
    "tasksStaged": 0,
    "tasksRunning": 1,
    "tasksHealthy": 0,
    "tasksUnhealthy": 0,
    "deployments": [],
    "tasks": [
      {
        "id": "test2.bee624f1-1dc7-11e6-b98e-568f3f9dead8",
        "slaveId": "9f7c6ed5-4bf5-475d-9311-05d21628604e-S18",
        "host": "ip-10-0-90-61.eu-west-1.compute.internal",
        "startedAt": "2016-05-19T13:44:02.190Z",
        "stagedAt": "2016-05-19T13:44:02.023Z",
        "ports": [
          31926
        ],
        "version": "2016-05-19T13:44:01.831Z",
        "ipAddresses": [
          {
            "ipAddress": "10.0.90.61",
            "protocol": "IPv4"
          }
        ],
        "appId": "/test2"
      }
    ],
    "lastTaskFailure": {
      "appId": "/test2",
      "host": "ip-10-0-90-61.eu-west-1.compute.internal",
      "message": "Slave ip-10-0-90-61.eu-west-1.compute.internal removed: health check timed out",
      "state": "TASK_LOST",
      "taskId": "test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c",
      "timestamp": "2016-05-19T13:15:24.155Z",
      "version": "2016-05-19T13:09:20.106Z",
      "slaveId": "9f7c6ed5-4bf5-475d-9311-05d21628604e-S17"
    }
  }
}

Slave log when running the app without:

I0519 13:09:22.471876 12459 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.471906 12459 status_update_manager.cpp:497] Creating StatusUpdate stream for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.472262 12459 status_update_manager.cpp:824] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.477686 12459 status_update_manager.cpp:374] Forwarding update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000 to the agent
I0519 13:09:22.477830 12453 process.cpp:2605] Resuming slave(1)@10.0.90.61:5051 at 2016-05-19 13:09:22.477814016+00:00
I0519 13:09:22.477967 12453 slave.cpp:3638] Forwarding the update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000 to master@10.0.82.230:5050
I0519 13:09:22.478185 12453 slave.cpp:3532] Status update manager successfully handled status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.478229 12453 slave.cpp:3548] Sending acknowledgement for status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000 to executor(1)@10.0.90.61:34262
I0519 13:09:22.488315 12460 pid.cpp:95] Attempting to parse 'master@10.0.82.230:5050' into a PID
I0519 13:09:22.488370 12460 process.cpp:646] Parsed message name 'mesos.internal.StatusUpdateAcknowledgementMessage' for slave(1)@10.0.90.61:5051 from master@10.0.82.230:5050
I0519 13:09:22.488452 12452 process.cpp:2605] Resuming slave(1)@10.0.90.61:5051 at 2016-05-19 13:09:22.488441856+00:00
I0519 13:09:22.488600 12458 process.cpp:2605] Resuming (14)@10.0.90.61:5051 at 2016-05-19 13:09:22.488590080+00:00
I0519 13:09:22.488632 12458 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.488726 12458 status_update_manager.cpp:824] Checkpointing ACK for status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.492985 12452 process.cpp:2605] Resuming slave(1)@10.0.90.61:5051 at 2016-05-19 13:09:22.492974080+00:00
I0519 13:09:22.493021 12452 slave.cpp:2629] Status update manager successfully handled status update acknowledgement (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000

Can you post logs from marathon. Especially part where it accept offers. — janisz, May 31 '16 at 09:24

score 0 · Answer 1 · answered Jun 15 '16 at 12:19

0

May be due to low disk space or RAM. Minimum Idle configuration is specified in the below link

answered Jun 15 '16 at 12:19

mytech skill

1

Mesos Marathon apps with persistent volume apps stuck at suspended

App with persistent storage

App without persistent storage

1 Answers1