1

I have some pods running on kubernetes and have created some alerts via prometheus where I added a metric

kube_pod_container_status_running

After adding this to scrape-config, The metric worked perfectly fine for a day, until it randomly just disappeared: see here As you can see from the image. I deployed it 3rd March and it worked perfectly fine until 6th march, where it all just disappeared.

I have been trying to figure out why.
I restarted the pods, increased the scrape_timeout from 10s (default) to 40s, but none of this helped.

Here is my prometheus scrape config for this metric:

      - job_name: kubernetes-service-endpoints
        metric_relabel_configs:
          - source_labels: [__name__]
            regex: kube_pod_container_status_running
            action: keep
          - source_labels: [ pod ]
            target_label: kubernetes_pod_name
            action: replace
        honor_timestamps: true
        scrape_interval: 1m
        scrape_timeout: 10s
        metrics_path: /metrics
        scheme: http
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [ __meta_kubernetes_service_annotation_prometheus_io_scrape ]
            separator: ;
            regex: "true"
            replacement: $1
            action: keep
          - source_labels: [ __meta_kubernetes_service_annotation_prometheus_io_scheme ]
            separator: ;
            regex: (https?)
            target_label: __scheme__
            action: replace
          - source_labels: [ __meta_kubernetes_service_annotation_prometheus_io_path ]
            separator: ;
            regex: (.+)
            target_label: __metrics_path__
            action: replace
          - source_labels: [ __address__, __meta_kubernetes_service_annotation_prometheus_io_port ]
            separator: ;
            regex: ([^:]+)(?::\d+)?;(\d+)
            target_label: __address__
            action: replace
          - separator: ;
            regex: __meta_kubernetes_service_label_(.+)
            action: labelmap
          - source_labels: [ __meta_kubernetes_namespace ]
            separator: ;
            regex: (.*)
            target_label: kubernetes_namespace
            action: replace
          - source_labels: [ __meta_kubernetes_service_name ]
            separator: ;
            regex: (.*)
            target_label: kubernetes_name
            action: replace
          - source_labels: [ __meta_kubernetes_pod_node_name ]
            separator: ;
            regex: (.*)
            target_label: kubernetes_node
            action: replace

Now the weird thing is that I deployed the same metric, with same scrape-config in another environment. And there it works completely fine! see here

I am afraid it might stop working here too. Any ideas why this is happening? My pods are running fine and they have not showed any errors.

It might be possible I missed something.

I restarted the pods to see if the metrics come back, it did not help. I also tried increasing the scrape_timeout without any luck. There are no errors from my logs on prometheus pods.

Ahmed Sbai
  • 10,695
  • 9
  • 19
  • 38
Anza
  • 11
  • 3
  • Attaching similar [git issue](https://github.com/prometheus/blackbox_exporter/issues/84) for reference. – Sai Chandra Gadde Mar 14 '23 at 11:40
  • I checked that issue out and from my understanding I need to disable keep alives. However, I do not have a similar blackbox exporter as the issue presents, hence I am unsure where to set my disable_keep_alives: true. I only have .yaml files as configs. Can I add it there somehow? Would appreciate tips! – Anza Mar 14 '23 at 14:21

0 Answers0