How can I get details of GKE fluentbit related error?

Question

We just found some logs are missing from Stackdriver, We can use kubectl logs for listing the logs message but some of them are not send to Stackdriver logs for some reason.
An example of a log entry that missing:

{"severity":"info","time":"2021-06-07T08:19:17.598Z","caller":"zap/options.go:212","msg":"finished unary call with code OK","grpc.start_time":"2021-06-07T08:19:17Z","system":"grpc","span.kind":"server","grpc.service":"manabie.tom.ChatService","grpc.method":"SendMessage","peer.address":"127.0.0.1:32806","userID":"xxxx","x-request-id":"xxxx","grpc.code":"OK","grpc.time_ms":48.04899978637695}

Checking fluentbit daemon:

kubectl logs fluentbit-gke-xxxx -c fluentbit-gke -f --tail=1

I see some error logs like:

W0607 08:16:55.066861       1 server.go:77] Received empty or invalid msgpack for tag kube_xxxxxxxx
W0607 08:16:59.072151       1 server.go:77] Received empty or invalid msgpack for tag kube_xxxxxxxx

Describe daemon set:

kubectl describe daemonset fluentbit-gke
Name:           fluentbit-gke
Selector:       component=fluentbit-gke,k8s-app=fluentbit-gke
Node-Selector:  kubernetes.io/os=linux
Labels:         addonmanager.kubernetes.io/mode=Reconcile
                k8s-app=fluentbit-gke
                kubernetes.io/cluster-service=true
Annotations:    deprecated.daemonset.template.generation: 9
Desired Number of Nodes Scheduled: 4
Current Number of Nodes Scheduled: 4
Number of Nodes Scheduled with Up-to-date Pods: 4
Number of Nodes Scheduled with Available Pods: 4
Number of Nodes Misscheduled: 0
Pods Status:  4 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           component=fluentbit-gke
                    k8s-app=fluentbit-gke
                    kubernetes.io/cluster-service=true
  Annotations:      EnableNodeJournal: false
                    EnablePodSecurityPolicy: false
                    SystemOnlyLogging: false
                    components.gke.io/component-name: fluentbit
                    components.gke.io/component-version: 1.4.4
                    monitoring.gke.io/path: /api/v1/metrics/prometheus
  Service Account:  fluentbit-gke
  Containers:
   fluentbit:
    Image:      gke.gcr.io/fluent-bit:v1.5.7-gke.1
    Port:       2020/TCP
    Host Port:  2020/TCP
    Limits:
      memory:  250Mi
    Requests:
      cpu:        50m
      memory:     100Mi
    Liveness:     http-get http://:2020/ delay=120s timeout=1s period=60s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /fluent-bit/etc/ from config-volume (rw)
      /var/lib/docker/containers from varlibdockercontainers (ro)
      /var/lib/kubelet/pods from varlibkubeletpods (rw)
      /var/log from varlog (rw)
      /var/run/google-fluentbit/pos-files from varrun (rw)
   fluentbit-gke:
    Image:      gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0
    Port:       2021/TCP
    Host Port:  2021/TCP
    Command:
      /fluent-bit-gke-exporter
      --kubernetes-separator=_
      --stackdriver-resource-model=k8s
      --enable-pod-label-discovery
      --pod-label-dot-replacement=_
      --split-stdout-stderr
      --logtostderr
    Limits:
      memory:  250Mi
    Requests:
      cpu:        50m
      memory:     100Mi
    Liveness:     http-get http://:2021/healthz delay=120s timeout=1s period=60s #success=1 #failure=3
    Environment:  <none>
    Mounts:       <none>
  Volumes:
   varrun:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/google-fluentbit/pos-files
    HostPathType:  
   varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
   varlibkubeletpods:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/pods
    HostPathType:  
   varlibdockercontainers:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker/containers
    HostPathType:  
   config-volume:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               fluentbit-gke-config-v1.0.6
    Optional:           false
  Priority Class Name:  system-node-critical
Events:                 <none>

score 0 · Answer 1 · answered Jun 07 '21 at 13:17

You may face that some logs exceed Cloud Logging API size limit.

Fluentbit-gke stores its logs in /var/log/fluentbit.log on each node, and these logs are not exported to Cloud Logging. This directory is a hostPath volume that mounts /var/log from the host node's file system into the Pod. The log file can be accessed from the host itself. If these logs are needed, fetch fluentbit logs from the node and provide the copy:

$ kubectl get nodes
$ gcloud compute scp <node_name>:/var/log/fluentbit.log* ./

Unlike Fluentd, Fluentbit in GKE 1.17 currently has a maximum size of single log entries of 32K. This will result in user logs that have size > 32K to be dropped by fluentbit and not exported to Cloud Logging. The size of single log entry as been increased to 1MB on GKE 1.18 clusters. This is the size that will be ingested into fluentbit, however, fluentbit will cut it to 200KB to leave some room to the additional metadata that will be added to the entry before it is written to Cloud Logging. This is because Cloud Logging API has limit of 256 KB on size of log entry.

sadly I don't have permission to check logs of GKE (managed cluster) but I don't think it can be the case of sizing since the missing message kind of short. (more or less in my example). And I can search for longer messages (error logs with stack trace for example). — nvcnvn, Jun 09 '21 at 09:52

How can I get details of GKE fluentbit related error?

1 Answers1

Linked