0

I have some weird problem in docker swarm - even though I have replicas set to 1, I still sometimes - after deploying new version get too many containers running (like previous container isn't killed after new one is created). To get it working correctly I need to rerun stack deploy. For now I don't know how to fix this problem, so I want to create a prometheus alert when this happens. I've tried using an expression that I took straight from grafana config and don't know why it fails with error

rule 4, "too_many_containers_per_service": could not parse expression: parse error at char 72: unexpected character inside braces: '\\\\'"

Edit: There is a progress as I was able to run the prometheus container without any error, but I don't get any alerts when there is more than 1 container of a service. Not sure what is wrong.

The config:

  - alert: too_many_containers_per_service
    expr: sum(rate(container_last_seen{container_label_com_docker_swarm_node_id=~"node_id"}[5m])) by (container_label_com_docker_swarm_service_name) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      description: Too many containers of {{ $labels.service_name }} are running simultaneously!
      summary: Containers duplicate alert for service '{{ $labels.service_name }}'

UPDATE:

I was able to make it run by removing the node filter (didn't need one since I run single node swarm). My config now looks like this:

  - alert: too_many_containers_per_service
    expr: count(container_last_seen) by (container_label_com_docker_swarm_service_name) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      description: Too many containers of '{{ $labels.container_label_com_docker_swarm_service_name }}' are running simultaneously!
      summary: Containers duplicate alert for service '{{ $labels.container_label_com_docker_swarm_service_name }}'

The problem I have now is that I keep getting one alert for like "null" service.

Too many containers of '' are running simultaneously!

What is wrong with that? It never goes away.

Pepsko
  • 63
  • 6

1 Answers1

1

You don't need to escape quotations in YAML. Also, there is no variable inlining for ${node_id}, if that's what you were trying to do.

sum(rate(container_last_seen{container_label_com_docker_swarm_node_id=~"node_id"}[5m])) by (container_label_com_docker_swarm_service_name)>1
Levi Harrison
  • 493
  • 3
  • 10
  • There is a progress as I was able to run the prometheus container without any error, but I don't get any alerts when there is more than 1 container of a service. Not sure what is wrong. – Pepsko Nov 17 '21 at 18:49
  • Did you set the `node_id` or just keep it as "node_id"? You might just want to leave out the filter so it will alert for all nodes. – Levi Harrison Nov 20 '21 at 23:10
  • Yeah I have removed the filter as you said and it finally works! But one problem I have is that all the time there is alert on '' service (like null). Don't know how to get rid of this. I updated my question to reflect my new configuration – Pepsko Nov 23 '21 at 20:29