GKE metrics agent logging many errors

Question

We have created GKE cluster and we are getting errors from gke-metrics-agent. The errors shows up every cca 30 minutes. It's always the same 62 errors.

All the errors have label k8s-pod/k8s-app: "gke-metrics-agent".

First error is:

error   exporterhelper/queued_retry.go:245  Exporting failed. Try enabling retry_on_failure config option.  {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."

This error is followed by these errors in order

"go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send"
"/go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:245"
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
/go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/metrics.go:120

There are cca 40 errors like this. Two errors which stand out are:

- error exporterhelper/queued_retry.go:175  Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures.  {"kind": "exporter", "name": "googlecloud", "dropped_items": 19}"

- warn  batchprocessor/batch_processor.go:184   Sender failed   {"kind": "processor", "name": "batch", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."}"

I tried to search those errors on google but I could not find anything. I can't even find any documentation for gke-metrics-agent.

Things I tried:

check quotas
update GKE to newer version (current version is 1.21.3-gke.2001)
update nodes
disable all firewall rules
give all permissions to k8s nodes

I can provide more information about our kubernetes cluster but I don't know what information may be important to solve this issue.

**“Deadline exceeded”** is a [known issue](https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/releases/tag/v0.13.6) and starting from Kubernetes 1.16, metrics are sent to Cloud Monitoring via GKE Metrics agent which is built on top of [Open Telemetry](https://opentelemetry.io/). Can you provide the details about the version you are using for OpenCensus exporter and check by updating the OpenCensus exporter version which increases the timeout and let me know whether it works? — Srividya, Oct 17 '21 at 06:49
Thanks for the response. It seems that I don't know how to update OpenCensus exporter. I found gke-metrics-agent pod in kubernetes and I tried to change the annotation components.gke.io/component-version: 0.6.0 to 0.13.6. This restarted the pods but the error is styl present. I also tried to change monitoring to open telemetry but I don't know how. Is it possible to set this using terraform? I found only monitoring_service setting which is set to monitoring.googleapis.com/kubernetes by default. — Melchy, Oct 17 '21 at 09:33
Can you check this link for the [OpenCensus](https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/releases/tag/v0.13.6) exporter update and for [OpenTelemetry](https://github.com/GoogleCloudPlatform/opentelemetry-operations-java) operations on google cloud? — Srividya, Oct 18 '21 at 10:58
How did it end? I observe the same behaviour with 1.20.10-gke.301. — Maciek Leks, Oct 25 '21 at 04:48
I still have no idea what to do. I checked the link to OpenCensus and I can see that there is new version but I still have no idea how to update it. Maybe I should delete the default exporter and create custom exporter with new version? — Melchy, Oct 28 '21 at 16:33

score 2 · Accepted Answer · answered Nov 09 '21 at 09:44

“Deadline exceeded” is a known issue, metrics are sent to Cloud Monitoring via GKE Metrics agent which is built on top of Open Telemetry. Currently there are two workarounds as following to resolve the issue:

1.Updating timeout.

Since the new release included a change that increases the default timeout from 5 to 12 seconds. So you might need to rebuild and redeploy the workload with the new version that could fix this rpc error.

2.To use higher GKE versions, this issue has a fix with gke-metrics-agent versions: 1.18.6-gke.6400+ 1.19.3-gke.600+ 1.20.0-gke.600+.

@Melchy, If you think that the above answer helped you, please consider accepting it (✔️). — Chandra Kiran Pasumarti, Nov 16 '21 at 14:13

score 0 · Answer 2 · answered Mar 06 '22 at 03:38

If you are still seeing those errors, please have a look at your metrics. Mainly kubernetes.io/container/... metrics for containers running on the same node as the gke-metrics-agent logging the errors. Do you see gaps in the metrics that shouldn't be there?

The context exceeded errors can happen once in a while, but should not in huge amounts. It may be networking issues or just occasional blips. Do you have any network policies/firewall rules that may prevent gke-metrics-agent from talking to Cloud Monitoring?

Sadly you can't update open-telemetry inside the gke-metrics-agent yourself. A newer cluster version can help too as it updates the agent, so try upgrading your cluster if possible. If the issue impacts your metrics, reach out to support.

Hi, thanks for he response I don't see the errors anymore. After updating k8s cluster and waitning for cca one week the erros suddenly dissapeared. I have no idea why. — Melchy, Mar 07 '22 at 07:44
Then you might have received a new version of gke-metrics-agent with a fix. — kwiesmueller, Mar 08 '22 at 13:43

GKE metrics agent logging many errors

2 Answers2