0

I have a docker swarm running our business stack defined in a docker-compose.yml on two servers (nodes). The docker-compose has defined cAdvisor starting on each of the two nodes like that:

  cadvisor:
    image: gcr.io/google-containers/cadvisor:latest
    command: "--logtostderr --housekeeping_interval=30s"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /:/rootfs:ro
      - /var/run:/var/run
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk:/dev/disk/:ro
    ports:
      - "9338:8080"
    deploy:
      mode: global
      resources:
        limits:
          memory: 128M
        reservations:
          memory: 64M

On a third server I run a docker separately from the docker swarm on node 1 and 2 and this server is used to run Prometheus and Grafana. Prometheus is configured to scrape only the node1:9338 resource to get the cAdvisor information.

I occasionally get the problem that when scraping node1:9338 not all containers running on both nodes 1 and 2 are shown in the cAdvisor statistics.

I was assuming that cAdvisor is synching its information in the swarm so that I'm able to configure Prometheus to only use node1:9338 as entrypoint into the docker swarm and scraping the information.

Or do I have to also put node2:9338 into my Prometheus configuration to always get all information of all nodes? If yes, how should this scale then because I would need to add each new node to the Prometheus config.

Running Prometheus together with the business stack in one swarm is no option.

edit: I experienced today when opening the cAdvisor metrics URL http://node1:9338/metrics as well as http://node2:9338/metrics a strange behaviour as I see the same information of all containers running on node1 on both URLs. The information of the containers running on node2 are missing when requesting http://node2:9338/metrics.

Could it be that the docker-internal load balancing is routing the request from http://node2:9338/metrics to the node1:9338 cAdvisor so the metrics of node1 are shown despite node2 is requested?

mr.simonski
  • 247
  • 4
  • 13

2 Answers2

1

cAdvisor looks at the container information provided by Linux on that machine, it knows nothing of Swarm. You'll want to have Prometheus scraping all your machines.

brian-brazil
  • 3,952
  • 1
  • 22
  • 16
  • So cAdvisor isn't syncing between the nodes by itself, having all information of the whole swarm available on each node? How does this scale then? Do I have to update the prometheus config each time a new node joins the swarm? – mr.simonski Jun 30 '20 at 11:42
  • Today I manually opended the cAdvisor /metrics URL on node1 and node2 and I was very suprised that I saw on both nodes the metrics of the container of node1. So the cAdvisor running on node2 was aware of the containers running on node1. Somehow there must be a connection between them. – mr.simonski Jul 13 '20 at 15:01
0

Indeed the problem was the docker-internal load balancing in swarm mode.

As I wrote in my initial post we were adding cAdvisor to our docker-compose file and we were instantiating the docker-swarm via

docker stack deploy --prune --with-registry-auth -c docker-compose.yml MY_STACK

The configuration of cAdvisor with

deploy:
      mode: global

leads to one instance per node but requesting a certain node via http://node2:9338/metrics doesn't mean you get the result of cAdvisor running on that node. The internal docker network might reroute your request to http://node1:9338/metrics so that you won't be able to scrape the real cAdvisor results from node2.

The solution which worked for me was to explicit tell docker to use mode: host in the ports section of cAdvisor in my docker-compose. My final config looks like:

 cadvisor:
    image: gcr.io/google-containers/cadvisor:latest
    command: "--logtostderr --housekeeping_interval=30s"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /:/rootfs:ro
      - /var/run:/var/run
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk:/dev/disk/:ro
    ports:
      - target: 8080
        published: 9338
        protocol: tcp
        mode: host
    deploy:
      mode: global
      resources:
        limits:
          cpus: "1"
          memory: 128M
        reservations:
          memory: 64M

Please notice the changed ports section.

mr.simonski
  • 247
  • 4
  • 13