"[error] [upstream] connection timed out after 10 seconds" failed when fluent-bit tries to communicate with fluentd in Kubernetes

Question

I'm using fluent-bit to collect logs and pass it to fluentd for processing in a Kubernetes environment. Fluent-bit instances are controlled by DaemonSet and read logs from docker containers.

      [INPUT]
      Name tail
      Path /var/log/containers/*.log
      Parser docker
      Tag kube.*
      Mem_Buf_Limit 5MB
      Skip_Long_Lines On

There is a fluent-bit service also running

Name:              monitoring-fluent-bit-dips
Namespace:         dips
Labels:            app.kubernetes.io/instance=monitoring
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=fluent-bit-dips
                   app.kubernetes.io/version=1.8.10
                   helm.sh/chart=fluent-bit-0.19.6
Annotations:       meta.helm.sh/release-name: monitoring
                   meta.helm.sh/release-namespace: dips
Selector:          app.kubernetes.io/instance=monitoring,app.kubernetes.io/name=fluent-bit-dips
Type:              ClusterIP
IP Families:       <none>
IP:                10.43.72.32
IPs:               <none>
Port:              http  2020/TCP
TargetPort:        http/TCP
Endpoints:         10.42.0.144:2020,10.42.1.155:2020,10.42.2.186:2020 + 1 more...
Session Affinity:  None
Events:            <none>

Fluentd service description is as below

Name:              monitoring-logservice
Namespace:         dips
Labels:            app.kubernetes.io/instance=monitoring
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=logservice
                   app.kubernetes.io/version=1.9
                   helm.sh/chart=logservice-0.1.2
Annotations:       meta.helm.sh/release-name: monitoring
                   meta.helm.sh/release-namespace: dips
Selector:          app.kubernetes.io/instance=monitoring,app.kubernetes.io/name=logservice
Type:              ClusterIP
IP Families:       <none>
IP:                10.43.44.254
IPs:               <none>
Port:              http  24224/TCP
TargetPort:        http/TCP
Endpoints:         10.42.0.143:24224
Session Affinity:  None
Events:            <none>

But fluent-bit logs doesn't reach fluentd and getting following error

[error] [upstream] connection #81 to monitoring-fluent-bit-dips:24224 timed out after 10 seconds

I tried several things like;

re-deploying fluent-bit pods
re-deploy fluentd pod
Upgrade fluent-bit version from 1.7.3 to 1.8.10

This is an Kubernetes environment where fluent-bit able to communicate with fluentd in the very earlier stage of deployment. Apart from that, this same fluent versions is working when I deploy locally with docker-desktop environment.

My guesses are

fluent-bit cannot manage the amount of log process
fluent services are unable to communicate once the services are restarted

Anyone having any experience in this or has any idea how to debug this issue more deeper?

Updated following with fluentd running pod description

Name:         monitoring-logservice-5b8864ffd8-gfpzc
Namespace:    dips
Priority:     0
Node:         sl-sy-k3s-01/10.16.1.99
Start Time:   Mon, 29 Nov 2021 13:09:13 +0530
Labels:       app.kubernetes.io/instance=monitoring
              app.kubernetes.io/name=logservice
              pod-template-hash=5b8864ffd8
Annotations:  kubectl.kubernetes.io/restartedAt: 2021-11-29T12:37:23+05:30
Status:       Running
IP:           10.42.0.143
IPs:
  IP:           10.42.0.143
Controlled By:  ReplicaSet/monitoring-logservice-5b8864ffd8
Containers:
  logservice:
    Container ID:   containerd://102483a7647fd2f10bead187eddf69aa4fad72051d6602dd171e1a373d4209d7
    Image:          our.private.repo/dips/logservice/splunk:1.9
    Image ID:       our.private.repo/dips/logservice/splunk@sha256:531f15f523a251b93dc8a25056f05c0c7bb428241531485a22b94896974e17e8
    Ports:          24231/TCP, 24224/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Mon, 29 Nov 2021 13:09:14 +0530
    Ready:          True
    Restart Count:  0
    Liveness:       exec [/bin/healthcheck.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      exec [/bin/healthcheck.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SOME_ENV_VARS
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from monitoring-logservice-token-g9kwt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  monitoring-logservice-token-g9kwt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  monitoring-logservice-token-g9kwt
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

There's only one Fluentd pod in your question, can you `kubectl describe` this pod and post the complete output in your question? — gohm'c, Nov 29 '21 at 08:35
@gohm'c is there any specific details that you want to see? Since it is more than 50 lines. — AnujAroshA, Nov 29 '21 at 09:42

score 1 · Answer 1 · answered Nov 29 '21 at 12:32

1

Try change your fluent-bit config that points to fluentd service as monitoring-logservice.dips:24224

answered Nov 29 '21 at 12:32

gohm'c

13,492
1
9
16

score 0 · Answer 2 · answered Oct 20 '22 at 03:24

https://docs.fluentbit.io/manual/pipeline/filters/kubernetes

  filters: |
    [FILTER]
        Name kubernetes
        Match kube.*
        Kube_URL            https://kubernetes.default:443
        tls.verify Off

In my issue, Kubernetes Apiserver ssl error.

"[error] [upstream] connection timed out after 10 seconds" failed when fluent-bit tries to communicate with fluentd in Kubernetes

2 Answers2