3

I have 3 CentOS VMs and I have installed Zookeeper, Marathon, and Mesos on the master node, while only putting Mesos on the other 2 VMs. The master node has no mesos-slave running on it. I am trying to run Docker containers so i specified "docker,mesos" in the containerizes file. One of the mesos-agents starts fine with this configuration and I have been able to deploy a container to that slave. However, the second mesos-agent simply fails when I have this configuration (it works if i take out that containerizes file but then it doesn't run containers). Here are some of the logs and information that has come up:

Here are some "messages" in the log directory:

Apr 26 16:09:12 centos-minion-3 systemd: Started Mesos Slave.
Apr 26 16:09:12 centos-minion-3 systemd: Starting Mesos Slave...
WARNING: Logging before InitGoogleLogging() is written to STDERR
[main.cpp:243] Build: 2017-04-12 16:39:09 by centos
[main.cpp:244] Version: 1.2.0
[main.cpp:247] Git tag: 1.2.0
[main.cpp:251] Git SHA: de306b5786de3c221bae1457c6f2ccaeb38eef9f
[logging.cpp:194] INFO level logging started!
[systemd.cpp:238] systemd version `219` detected
[main.cpp:342] Inializing systemd state
[systemd.cpp:326] Started systemd slice `mesos_executors.slice`
[containerizer.cpp:220] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
[linux_launcher.cpp:150] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
[provisioner.cpp:249] Using default backend 'copy'
[slave.cpp:211] Mesos agent started on (1)@172.22.150.87:5051
[slave.cpp:212] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --http_heartbeat_interval="30secs" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="linux" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
[slave.cpp:541] Agent resources: cpus(*):1; mem(*):919; disk(*):2043; ports(*):[31000-32000]
[slave.cpp:549] Agent attributes: [  ]
[slave.cpp:554] Agent hostname: node3
[status_update_manager.cpp:177] Pausing sending status updates
[state.cpp:62] Recovering state from '/var/lib/mesos/meta'
[state.cpp:706] No committed checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
[status_update_manager.cpp:203] Recovering status update manager
[docker.cpp:868] Recovering Docker containers
[containerizer.cpp:599] Recovering containerizer
[provisioner.cpp:410] Provisioner recovery complete
[group.cpp:340] Group process (zookeeper-group(1)@172.22.150.87:5051) connected to ZooKeeper
[group.cpp:830] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
[group.cpp:418] Trying to create path '/mesos' in ZooKeeper
[detector.cpp:152] Detected a new leader: (id='15')
[group.cpp:699] Trying to get '/mesos/json.info_0000000015' in ZooKeeper
[zookeeper.cpp:259] A new leading master (UPID=master@172.22.150.88:5050) is detected
Failed to perform recovery: Collect failed: Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1; stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
       This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service: main process exited, code=exited, status=1/FAILURE
Apr 26 16:09:13 centos-minion-3 systemd: Unit mesos-slave.service entered failed state.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service failed.

Logs from docker:

$ sudo systemctl status docker
● docker.service - Docker Application Container Engine Loaded: 
  loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled) 
  Drop-In: /usr/lib/systemd/system/docker.service.d 
  └─flannel.conf Active: inactive (dead) since Tue 2017-04-25 18:00:03 CDT; 
    24h ago Docs: docs.docker.com Main PID: 872 (code=exited, status=0/SUCCESS) 
    Apr 26 18:25:25 centos-minion-3 systemd[1]: Dependency failed for Docker Application Container Engine. 
    Apr 26 18:25:25 centos-minion-3 systemd[1]: Job docker.service/start failed with result 'dependency'

Logs from flannel:

[flanneld-start: network.go:102] failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
janisz
  • 6,292
  • 4
  • 37
  • 70
kmahesh3
  • 31
  • 2
  • 1
    Can you attach more logs? Especially including reason why agent is failing? Do not attach them as images, use code formatted text instead – janisz Apr 26 '17 at 14:32
  • Hi janisz- I added all the logs I could find that are relevant to this problem. If there are any more files that would be helpful please let me know. Would love to hear any tips or tricks you might have. Thanks so much for the help! – kmahesh3 Apr 26 '17 at 18:40
  • 1
    Can you post error log too? I think something is missing here. – janisz Apr 26 '17 at 18:52
  • Sorry, not sure which file you are referencing. I am currently looking at the /var/log/mesos/ folder in the failed node (I posted a list of the files in this dir in the original post). – kmahesh3 Apr 26 '17 at 20:01
  • 1
    `mesos-slave.centos-minion-3.invalid-user.log.ERROR` – janisz Apr 26 '17 at 20:21
  • The only logs in that directory are .INFO logs. Do you know where i can find .ERROR logs in the slave node. I did find some messages that i posted above. – kmahesh3 Apr 26 '17 at 21:08
  • It should be in the same directory as info log. Alternatively you can try to use journalctl to get all logs. – janisz Apr 26 '17 at 21:13
  • My mesos log directory has no subfolders with .ERROR logs and journalctl said "No journal files were found". Is there a command i can run to start getting these error logs – kmahesh3 Apr 26 '17 at 21:20
  • Have you tired `journalctl -u mesos-slave.service` – janisz Apr 26 '17 at 21:26
  • I just tried that but again got no files. Any other ideas? Thanks for all the help by the way! – kmahesh3 Apr 26 '17 at 21:31
  • It looks like you already posted interesting log. I'm sorry I'm on mobile and it's hard to spot it. `[21625]: Failed to perform recovery: `simply `rm -f /var/lib/mesos/meta/slaves/` Mesos keeps it state in local dosk. If you change configuration and restart agent then configuration change might be not backward compatible and this require to wipe out previous state and start as a new clean agent. – janisz Apr 26 '17 at 21:46
  • Yea i tried that and restarted the master and both slaves but still running into the same issue – kmahesh3 Apr 26 '17 at 21:55
  • Is the docker daemon running on the host? – janisz Apr 26 '17 at 21:58
  • I have notices that running "sudo docker version" on the failed node gives me a message saying "Cannot connect to the Docker daemon. Is the docker daemon running on this host?". The alive slave does not give me this message. I tried following this https://github.com/docker/kitematic/issues/1010 but i got "bash: docker-machine: command not found" – kmahesh3 Apr 26 '17 at 22:02
  • What is the results of `sudo systemctl status docker` – janisz Apr 26 '17 at 22:22
  • ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/docker.service.d └─flannel.conf Active: inactive (dead) since Tue 2017-04-25 18:00:03 CDT; 24h ago Docs: http://docs.docker.com Main PID: 872 (code=exited, status=0/SUCCESS) Apr 26 18:25:25 centos-minion-3 systemd[1]: Dependency failed for Docker Application Container Engine. Apr 26 18:25:25 centos-minion-3 systemd[1]: Job docker.service/start failed with result 'dependency'. – kmahesh3 Apr 26 '17 at 23:35
  • yea something with the Docker is going wrong. I have been running through online forums and have tried reinstalling but nothing has fixed it yet. The command above works on the alive node. – kmahesh3 Apr 26 '17 at 23:36
  • It looks like [Flannel failure prevents docker from starting](https://github.com/coreos/bugs/issues/1393). Can you check what is going on? Is communication with etcd working? – janisz Apr 27 '17 at 08:22

2 Answers2

0

You have answer in your logs

Failed to perform recovery: Collect failed: 
Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1; 
stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
       This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.

Mesos keeps it state/metadata on local disk. When it's restarted it try to load this state. If configuration changed and is not compatible with previous state it won't start.

Just bring docker to live by fixing problems with flannel and etcd and everything will be fine.

janisz
  • 6,292
  • 4
  • 37
  • 70
0

add the following flag while starting agent,

--reconfiguration_policy=additive

more details here: http://mesos.apache.org/documentation/latest/agent-recovery/

KETAN PATIL
  • 2,276
  • 2
  • 13
  • 18