I have some pods running on kubernetes and have created some alerts via prometheus where I added a metric
kube_pod_container_status_running
After adding this to scrape-config, The metric worked perfectly fine for a day, until it randomly just disappeared: see here As you can see from the image. I deployed it 3rd March and it worked perfectly fine until 6th march, where it all just disappeared.
I have been trying to figure out why.
I restarted the pods, increased the scrape_timeout from 10s (default) to 40s, but none of this helped.
Here is my prometheus scrape config for this metric:
- job_name: kubernetes-service-endpoints
metric_relabel_configs:
- source_labels: [__name__]
regex: kube_pod_container_status_running
action: keep
- source_labels: [ pod ]
target_label: kubernetes_pod_name
action: replace
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [ __meta_kubernetes_service_annotation_prometheus_io_scrape ]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [ __meta_kubernetes_service_annotation_prometheus_io_scheme ]
separator: ;
regex: (https?)
target_label: __scheme__
action: replace
- source_labels: [ __meta_kubernetes_service_annotation_prometheus_io_path ]
separator: ;
regex: (.+)
target_label: __metrics_path__
action: replace
- source_labels: [ __address__, __meta_kubernetes_service_annotation_prometheus_io_port ]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
action: replace
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
action: labelmap
- source_labels: [ __meta_kubernetes_namespace ]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
action: replace
- source_labels: [ __meta_kubernetes_service_name ]
separator: ;
regex: (.*)
target_label: kubernetes_name
action: replace
- source_labels: [ __meta_kubernetes_pod_node_name ]
separator: ;
regex: (.*)
target_label: kubernetes_node
action: replace
Now the weird thing is that I deployed the same metric, with same scrape-config in another environment. And there it works completely fine! see here
I am afraid it might stop working here too. Any ideas why this is happening? My pods are running fine and they have not showed any errors.
It might be possible I missed something.
I restarted the pods to see if the metrics come back, it did not help. I also tried increasing the scrape_timeout without any luck. There are no errors from my logs on prometheus pods.