1

I have K8S cluster in GCP (version is 1.20.8-gke.900 from the regular update channel). All cluster pods write logs in STDOUT or STDERR from Docker containers.

A couple of weeks ago we found that some log entries are missing in the GCP logging console. I can see them via kubectl tool but looks like they don't reach the logging bucket. For example, I can hit API in the pod with invalid payload to emulate error in the logs, and sometimes this error reaches the logging bucket, sometimes no. Super weird to me...

The traffic and resource utilization in the cluster is super small.

As I understood fluent bit daemonset is responsible to fetch logs from pods and pass them into logging bucket. Current version of fluent bit: gke.gcr.io/fluent-bit:v1.5.7-gke.1 & gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0.

I don't see any errors in the fluent bit logs...

Could you please suggest what can be done to trace/debug/troubleshoot such case?

Thanks!

1 Answers1

1

It appears the issue is with the log volume. The managed GKE logging agent is guaranteed at least 100KiB/s throughput and performance can be higher depending on other node factors.

If your workloads on a GKE node are generating significantly more than 100KiB/s, then it's possible that the logs are not being collected due to the log volume.

If you're generating more than 100kb/s, then there's a few workarounds:

  1. Generate less logs.
  2. Leave the node in question partially idle. This will allow fluentbit to pick up extra cpu cycles and process more logs.
  3. Run your own instance of fluentbit with a higher resource allocation.

The underlying root cause of the 100kb/s limitation is that we only give a small resource allocation to fluentbit so as to leave more resources available for your workloads.

Refer to link for additional information.

Fariya Rahmat
  • 2,123
  • 3
  • 11
  • 1
    Fariya, thank you a lot for the response! I tried this guide to setup custom fluent-bir agent: https://cloud.google.com/community/tutorials/kubernetes-engine-customize-fluentbit but unfortunately, it doesn't work... I the logs I see Stackdriver HTTP 400 response... Maybe you can suggest some other guide for custom setup? – Viktor Kurchenko Aug 05 '21 at 16:34
  • Try setting up using this guide :https://github.com/fluent/fluent-bit-kubernetes-logging and for the HTTP 400 response check this out:https://stackoverflow.com/questions/64524414/stackdriver-error-logging-api-error-code-400-message-missing-group-id – Fariya Rahmat Aug 06 '21 at 04:46
  • 1
    Hey Fariya, looks like I found the issue's root cause. It was caused by an invalid log JSON (there were two fields with the same name). Any way thank you for the help! – Viktor Kurchenko Aug 16 '21 at 09:02