Prometheus Operator - OOM killed when enabling Istio monitoring

Question

I would like to ask you for help - how can I prevent Prometheus from being killed with Out Of Memory when enabling Istio metrics monitoring? I use Prometheus Operator and the monitoring of the metrics works fine until I create the ServiceMonitors for Istio taken from this article by Prune on Medium. From the article they are as follows:

ServiceMonitor for Data Plane:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-oper-istio-dataplane
  labels:
    monitoring: istio-dataplane
    release: prometheus
spec:
  selector:
    matchExpressions:
      - {key: istio-prometheus-ignore, operator: DoesNotExist}
  namespaceSelector:
    any: true
  jobLabel: envoy-stats
  endpoints:
  - path: /stats/prometheus
    targetPort: http-envoy-prom
    interval: 15s
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_container_port_name]
      action: keep
      regex: '.*-envoy-prom'
    - action: labelmap
      regex: "__meta_kubernetes_pod_label_(.+)"
    - sourceLabels: [__meta_kubernetes_namespace]
      action: replace
      targetLabel: namespace
    - sourceLabels: [__meta_kubernetes_pod_name]
      action: replace
      targetLabel: pod_name

ServiceMonitor for Control Plane:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-oper-istio-controlplane
  labels:
    release: prometheus
spec:
  jobLabel: istio
  selector:
    matchExpressions:
      - {key: istio, operator: In, values: [mixer,pilot,galley,citadel,sidecar-injector]}
  namespaceSelector:
    any: true
  endpoints:
  - port: http-monitoring
    interval: 15s
  - port: http-policy-monitoring
    interval: 15s

After the ServiceMonitor for Istio Data Plane is created the usage of memory goes in just a minute from around 10GB up to 30GB and the Prometheus replicas are killed by Kubernetes. CPU usage is normal. How can prevent such a huge increase in resources usage? Is there something wrong with the relabelings? It is supposed to scrape the metrics from around 500 endpoints.

[EDIT]

From the investigation it seems that this what have a great impact on the resource usage is the relabelings. For example if I change the targetLabel to pod instead of the pod_name the resources usage grows up immediately.

Anyway, I did not find the solution to this issue. I have used the semi-official ServiceMonitor and the PodMonitor provided by the Istio on GithHub, but it just made Prometheus to run longer before Out Of Memory Exception. Now it takes around an hour to go from ~10GB to 32GB of memory usage.

This what I can see is that after enabling the Istio metrics, the number of time series grows quite fast and never stops, what in my opinion looks like the memory leak. Before enabling Istio monitoring this number is quite stable.

Do you have any other suggestions?

what does this calculator say: https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion — Jens Baitinger, Mar 12 '21 at 07:38
Hi! Sorry, for the late response. It shows 32GB, because I have set very high interval between the scrapes. Anyway I can see that the Istio metrics have a lot of labels (like 27 or so). Do you know if that is expected? I did not change anything in Istio or the ServiceMonitors from the article above. — Joe, Mar 16 '21 at 20:06
in that case you might want to give it 32 GB or try to split the load to several prometheus instances (sharding by namespace, special tags,...) — Jens Baitinger, Mar 17 '21 at 21:52
@JensBaitinger it does not seem to work as the memory usage is growing and growing, together with the number of time series. I have edited the question with my latest findings. — Joe, Mar 24 '21 at 10:38

Prometheus Operator - OOM killed when enabling Istio monitoring

[EDIT]

0 Answers0