2

I've been putting together a POC mesos/marathon system that I am using to launch and control docker images.

I have a Vagrant virtual machine running in VirtualBox on which I run docker, marathon, zookeeper, mesos-master and mesos-slave processes, with everything working as expected.

I decided to add Chronos into the mix and initially I started with it running as a service on the vagrant VM, but then opted to switch to running it in a docker container using the mesosphere/chronos image.

I have found that I can get container image to start and run successfully when I specify HOST network mode for the container, but when I change to BRIDGE mode then I run into problems.

In BRIDGE mode, the chronos framework registers successfully with mesos (I can see the entry on the frameworks page of the mesos UI), but it looks as though the framework itself doesn't know that the registration was successful. The mesos master log if full of messages like:

strong textI1009 09:47:35.876454  3131 master.cpp:2094] Received SUBSCRIBE call for framework 'chronos-2.4.0' at scheduler-16d21dac-b6d6-49f9-90a3-bf1ba76b4b0d@172.17.0.59:37318
I1009 09:47:35.876832  3131 master.cpp:2164] Subscribing framework chronos-2.4.0 with checkpointing enabled and capabilities [  ]
I1009 09:47:35.876924  3131 master.cpp:2174] Framework 20151009-094632-16842879-5050-3113-0001 (chronos-2.4.0) at scheduler-16d21dac-b6d6-49f9-90a3-bf1ba76b4b0d@172.17.0.59:37318 already subscribed, resending acknowledgement

This implies some sort of configuration/communication issue but I have not been able to work out exactly what the root of the problem is. I'm not sure if there is any way to confirm if the acknowledgement from mesos is making it back to chronos or to check the status of the communication channels between the components.

I've done a lot of searching and I can find posts by folk who have encountered the same issue but I haven't found an detailed explanation of what needs to be done to correct it.

For example, I found the following post which mentions a problem that was resolved and which implies the user successfully ran their chronos container in bridge mode, but their description of the resolution was vague. There was also this post but the change suggested did resolve the issue that I am seeing.

Finally there was a post by someone at ILM who had what sound like exactly my problem and the resolution appeared to involve a fix to Mesos to introduce two new environment variables LIBPROCESS_ADVERTISE_IP and LIBPROCESS_ADVERTISE_PORT (on top of LIBPROCESS_IP and LIBPROCESS_PORT) but I can't find a decent explanation of what values should be assigned to any of these variables, so have yet to work out whether the change will resolve the issue I am having.

It's probably worth mentioning that I've also posted a couple of questions on the chronos-scheduler group, but I haven't had any responses to these.

If it's of any help the versions of software I'm running are as follows (the volume mount allows me to provide values of other parameters [e.g. master, zk_hosts] as files, without having to keep changing the JSON):

Vagrant:    1.7.4
VirtualBox: 5.0.2
Docker:     1.8.1
Marathon:   0.10.1
Mesos:      0.24.1
Zookeeper:  3.4.5

The JSON that I am using to launch the chronos container is as follows:

{
  "id": "chronos",
  "cpus": 1,
  "mem": 1024,
  "instances": 1,
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "mesosphere/chronos",
      "network": "BRIDGE",
      "portMappings": [
        {
          "containerPort": 4400,
          "hostPort": 0,
          "servicePort": 4400,
          "protocol": "tcp"
        }
      ]
    },
    "volumes": [
      {
        "containerPath": "/etc/chronos/conf",
        "hostPath": "/vagrant/vagrantShared/chronos",
        "mode": "RO"
      }
    ]
  },
  "cmd": "/usr/bin/chronos --http_port 4400",
  "ports": [
    4400
  ]
}

If anyone has any experience of using chronos in a configuration like this then I'd appreciate any help that you might be able to provide in resolving this issue.

Regards,

Paul Mateer

Community
  • 1
  • 1

2 Answers2

2

I managed to work out the answer to my problem (with a little help from the sample framework here), so I thought I should post a solution to help anyone else the runs into the same issue.

The chronos service (and also the sample framework) were configured to communicate with zookeeper on the IP associated with the docker0 interface on the host (vagrant) VM (in this case 172.17.42.1).

Zookeeper would report the master as being available on 127.0.1.1 which was the IP address of the host VM that the mesos-master process started on, but although this IP address could be pinged from the container any attempt to connect to specific ports would be refused.

The solution was to start the mesos-master with the --advertise_ip parameter and specify the IP of the docker0 interface. This meant that although the service started on the host machine it would appear as though it had been started on the docker0 ionterface.

Once this was done communications between mesos and the chronos framework started completeing and the tasks scheduled in chronos ran successfully.

0

Running Mesos 1.1.0 and Chronos 3.0.1, I was able to successfully configure Chronos in BRIDGE mode by explicitly setting LIBPROCESS_ADVERTISE_IP, LIBPROCESS_ADVERTISE_PORT and pinning its second port to a hostPort which isn't ideal but the only way I could find to make it advertise its port to Mesos properly:

{
  "id": "/core/chronos",
  "cmd": "LIBPROCESS_ADVERTISE_IP=$(getent hosts $HOST | awk '{ print $1 }') LIBPROCESS_ADVERTISE_PORT=$PORT1 /chronos/bin/start.sh --hostname $HOST --zk_hosts master-1:2181,master-2:2181,master-3:2181 --master zk://master-1:2181,master-2:2181,master-3:2181/mesos --http_credentials ${CHRONOS_USER}:${CHRONOS_PASS}",
  "cpus": 0.1,
  "mem": 1024,
  "disk": 100,
  "instances": 1,
  "container": {
    "type": "DOCKER",
    "volumes": [],
    "docker": {
      "image": "mesosphere/chronos:v3.0.1",
      "network": "BRIDGE",
      "portMappings": [
        {
          "containerPort": 9900,
          "hostPort": 0,
          "servicePort": 0,
          "protocol": "tcp",
          "labels": {}
        },
        {
          "containerPort": 9901,
          "hostPort": 9901,
          "servicePort": 0,
          "protocol": "tcp",
          "labels": {}
        }
      ],
      "privileged": true,
      "parameters": [],
      "forcePullImage": true
    }
  },
  "env": {
    "CHRONOS_USER": "admin",
    "CHRONOS_PASS": "XXX",
    "PORT1": "9901",
    "PORT0": "9900"
  }
}
moertel
  • 1,523
  • 15
  • 18