K8s Elasticsearch with filebeat is keeping 'not ready' after rebooting

Question

I'm going through a not very understandable situation.

Environment

Two dedicated nodes with azure centos 8.2 (2vcpu, 16G ram), not AKS

1 master node, 1 worker node.

kubernetes v1.19.3

helm v2.16.12

Helm charts Elastic (https://github.com/elastic/helm-charts/tree/7.9.3)

At the first time, It works fine with below installation.

## elasticsearch, filebeat
# kubectl apply -f pv.yaml
# helm install -f values.yaml --name elasticsearch elastic/elasticsearch
# helm install --name filebeat --version 7.9.3 elastic/filebeat

curl elasitcsearchip:9200 and curl elasitcsearchip:9200/_cat/indices show right values.

but after rebooting a worker node, it just keeping ready 0/1 and not working.

NAME READY STATUS RESTARTS AGE
elasticsearch-master-0 0/1 Running 10 71m
filebeat-filebeat-67qm2 0/1 Running 4 40m

In this situation, after removing /mnt/data/nodes and rebooting again then works fine.

elasticsearch pod has nothing special I think.

#describe
{"type": "server", "timestamp": "2020-10-26T07:49:49,708Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[filebeat-7.9.3-2020.10.26-000001][0]]]).", "cluster.uuid": "sWUAXJG9QaKyZDe0BLqwSw", "node.id": "ztb35hToRf-2Ahr7olympw"  }

#logs
  Normal   SandboxChanged          4m4s (x3 over 4m9s)   kubelet          Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                  4m3s                  kubelet          Container image "docker.elastic.co/elasticsearch/elasticsearch:7.9.3" already present on machine
  Normal   Created                 4m1s                  kubelet          Created container configure-sysctl
  Normal   Started                 4m1s                  kubelet          Started container configure-sysctl
  Normal   Pulled                  3m58s                 kubelet          Container image "docker.elastic.co/elasticsearch/elasticsearch:7.9.3" already present on machine
  Normal   Created                 3m58s                 kubelet          Created container elasticsearch
  Normal   Started                 3m57s                 kubelet          Started container elasticsearch
  Warning  Unhealthy               91s (x14 over 3m42s)  kubelet          Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )
Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )

#events
6m1s        Normal    Pulled                    pod/elasticsearch-master-0                     Container image "docker.elastic.co/elasticsearch/elasticsearch:7.9.3" already present on machine
6m1s        Normal    Pulled                    pod/filebeat-filebeat-67qm2                    Container image "docker.elastic.co/beats/filebeat:7.9.3" already present on machine
5m59s       Normal    Started                   pod/elasticsearch-master-0                     Started container configure-sysctl
5m59s       Normal    Created                   pod/elasticsearch-master-0                     Created container configure-sysctl
5m59s       Normal    Created                   pod/filebeat-filebeat-67qm2                    Created container filebeat
5m58s       Normal    Started                   pod/filebeat-filebeat-67qm2                    Started container filebeat
5m56s       Normal    Created                   pod/elasticsearch-master-0                     Created container elasticsearch
5m56s       Normal    Pulled                    pod/elasticsearch-master-0                     Container image "docker.elastic.co/elasticsearch/elasticsearch:7.9.3" already present on machine
5m55s       Normal    Started                   pod/elasticsearch-master-0                     Started container elasticsearch
61s         Warning   Unhealthy                 pod/filebeat-filebeat-67qm2                    Readiness probe failed: elasticsearch: http://elasticsearch-master:9200...
  parse url... OK
  connection...
    parse host... OK
    dns lookup... OK
    addresses: 10.97.133.135
    dial up... ERROR dial tcp 10.97.133.135:9200: connect: connection refused
59s         Warning   Unhealthy                 pod/elasticsearch-master-0                     Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )
Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )

/mnt/data path has chown 1000:1000

and In case of only elastisearch without filebeat, rebooting has no problem.

I can't figure this out at all. :(

What am I missing?

pv.yaml

kind: PersistentVolume
apiVersion: v1
metadata:
  name: elastic-pv
  labels:
    type: local
    app: elastic
spec:
  storageClassName: local-storage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  claimRef: 
    namespace: default
    name: elasticsearch-master-elasticsearch-master-0
  hostPath:
    path: "/mnt/data"

values.yaml

---
clusterName: "elasticsearch"
nodeGroup: "master"

# The service that non master groups will try to connect to when joining the cluster
# This should be set to clusterName + "-" + nodeGroup for your master group
masterService: ""

# Elasticsearch roles that will be applied to this nodeGroup
# These will be set as environment variables. E.g. node.master=true
roles:
  master: "true"
  ingest: "true"
  data: "true"

replicas: 1
minimumMasterNodes: 1

esMajorVersion: ""

# Allows you to add any config files in /usr/share/elasticsearch/config/
# such as elasticsearch.yml and log4j2.properties
esConfig: {}
#  elasticsearch.yml: |
#    key:
#      nestedkey: value
#  log4j2.properties: |
#    key = value

# Extra environment variables to append to this nodeGroup
# This will be appended to the current 'env:' key. You can use any of the kubernetes env
# syntax here
extraEnvs: []
#  - name: MY_ENVIRONMENT_VAR
#    value: the_value_goes_here

# Allows you to load environment variables from kubernetes secret or config map
envFrom: []
# - secretRef:
#     name: env-secret
# - configMapRef:
#     name: config-map

# A list of secrets and their paths to mount inside the pod
# This is useful for mounting certificates for security and for mounting
# the X-Pack license
secretMounts: []
#  - name: elastic-certificates
#    secretName: elastic-certificates
#    path: /usr/share/elasticsearch/config/certs
#    defaultMode: 0755

image: "docker.elastic.co/elasticsearch/elasticsearch"
imageTag: "7.9.3"
imagePullPolicy: "IfNotPresent"

podAnnotations: {}
  # iam.amazonaws.com/role: es-cluster

# additionals labels
labels: {}
esJavaOpts: "-Xmx1g -Xms1g"

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1000m"
    memory: "2Gi"

initResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

sidecarResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

networkHost: "0.0.0.0"

volumeClaimTemplate:
  accessModes: [ "ReadWriteOnce" ]
  storageClassName: local-storage
  resources:
    requests:
      storage: 5Gi

rbac:
  create: false
  serviceAccountAnnotations: {}
  serviceAccountName: ""

podSecurityPolicy:
  create: false
  name: ""
  spec:
    privileged: true
    fsGroup:
      rule: RunAsAny
    runAsUser:
      rule: RunAsAny
    seLinux:
      rule: RunAsAny
    supplementalGroups:
      rule: RunAsAny
    volumes:
      - secret
      - configMap
      - persistentVolumeClaim

persistence:
  enabled: true
  name: elastic-vc
  labels:
    # Add default labels for the volumeClaimTemplate fo the StatefulSet
    app: elastic
  annotations: {}

extraVolumes: []
  # - name: extras
  #   emptyDir: {}

extraVolumeMounts: []
  # - name: extras
  #   mountPath: /usr/share/extras
  #   readOnly: true

extraContainers: []
  # - name: do-something
  #   image: busybox
  #   command: ['do', 'something']

extraInitContainers: []
  # - name: do-something
  #   image: busybox
  #   command: ['do', 'something']

# This is the PriorityClass settings as defined in
# https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass
priorityClassName: ""

# By default this will make sure two pods don't end up on the same node
# Changing this to a region would allow you to spread pods across regions
antiAffinityTopologyKey: "kubernetes.io/hostname"

# Hard means that by default pods will only be scheduled if there are enough nodes for them
# and that they will never end up on the same node. Setting this to soft will do this "best effort"
antiAffinity: "hard"

# This is the node affinity settings as defined in
# https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#node-affinity-beta-feature
nodeAffinity: {}

# The default is to deploy all pods serially. By setting this to parallel all pods are started at
# the same time when bootstrapping the cluster
podManagementPolicy: "Parallel"

# The environment variables injected by service links are not used, but can lead to slow Elasticsearch boot times when
# there are many services in the current namespace.
# If you experience slow pod startups you probably want to set this to `false`.
enableServiceLinks: true

protocol: http
httpPort: 9200
transportPort: 9300

service:
  labels: {}
  labelsHeadless: {}
  type: ClusterIP
  nodePort: ""
  annotations: {}
  httpPortName: http
  transportPortName: transport
  loadBalancerIP: ""
  loadBalancerSourceRanges: []
  externalTrafficPolicy: ""

updateStrategy: RollingUpdate

# This is the max unavailable setting for the pod disruption budget
# The default value of 1 will make sure that kubernetes won't allow more than 1
# of your pods to be unavailable during maintenance
maxUnavailable: 1

podSecurityContext:
  fsGroup: 1000
  runAsUser: 1000

securityContext:
  capabilities:
    drop:
    - ALL
  #readOnlyRootFilesystem: false
  runAsNonRoot: true
  runAsUser: 1000

# How long to wait for elasticsearch to stop gracefully
terminationGracePeriod: 120

sysctlVmMaxMapCount: 262144

readinessProbe:
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 3
  timeoutSeconds: 5

# https://www.elastic.co/guide/en/elasticsearch/reference/7.9/cluster-health.html#request-params wait_for_status
clusterHealthCheckParams: "wait_for_status=green&timeout=1s"

## Use an alternate scheduler.
## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
##
schedulerName: ""

imagePullSecrets: []
nodeSelector: {}
tolerations: []
  # - effect: NoSchedule
  #   key: node-role.kubernetes.io/master

# Enabling this will publically expose your Elasticsearch instance.
# Only enable this if you have security enabled on your cluster
ingress:
  enabled: false
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  path: /
  hosts:
    - chart-example.local
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

nameOverride: ""
fullnameOverride: ""

# https://github.com/elastic/helm-charts/issues/63
masterTerminationFix: false

lifecycle: {}
  # preStop:
  #   exec:
  #     command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
  # postStart:
  #   exec:
  #     command:
  #       - bash
  #       - -c
  #       - |
  #         #!/bin/bash
  #         # Add a template to adjust number of shards/replicas
  #         TEMPLATE_NAME=my_template
  #         INDEX_PATTERN="logstash-*"
  #         SHARD_COUNT=8
  #         REPLICA_COUNT=1
  #         ES_URL=http://localhost:9200
  #         while [[ "$(curl -s -o /dev/null -w '%{http_code}\n' $ES_URL)" != "200" ]]; do sleep 1; done
  #         curl -XPUT "$ES_URL/_template/$TEMPLATE_NAME" -H 'Content-Type: application/json' -d'{"index_patterns":['\""$INDEX_PATTERN"\"'],"settings":{"number_of_shards":'$SHARD_COUNT',"number_of_replicas":'$REPLICA_COUNT'}}'

sysctlInitContainer:
  enabled: true

keystore: []

# Deprecated
# please use the above podSecurityContext.fsGroup instead
fsGroup: ""

Could you check if there is anything in filebeat and elasticsearch pods with `kubectl logs`? Additionally could you please add output from `kubectl describe ` of your filebeat pod? Could you check if it's gonna work if you change the [reclaim policy](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming) from `persistentVolumeReclaimPolicy: Retain` to `persistentVolumeReclaimPolicy: Recycle`? — Jakub, Oct 26 '20 at 11:56
@Jakub Hi, thanks for your reply. Recycle value returns also the same result.:( I've attached some information on kubectl describe, kubectl logs, and events. It's the results of 'Retain' PV. A postfix ready0 file means READY 0/1, STATUS Running after rebooting, and else means working fine at the moment (READY 1/1, STATUS Running). https://drive.google.com/file/d/1nvopi66fXHBh3HMjokyarh-EsveK9pK2/view?usp=sharing — Klaud Yu, Oct 27 '20 at 00:48
In the filebeat logs there is an issue with a flannel CNI, `networkPlugin cni failed to set up pod xxx network: open /run/flannel/subset.env: no such file or directory`. Could you tell me if your flannel pod is up and running? Additionally could you please check if there is anything in the kubelet logs with `journalctl -u kubelet`? Second thing is the readiness probe, there is a workaround for that on this [github issue](https://github.com/elastic/helm-charts/issues/783#issuecomment-701037663). Could you try it and check if it's gonna work? — Jakub, Oct 27 '20 at 13:03
Errors you said about networkPlugin come from rebooting action. As your github issue link I tried 'clusterHealthCheckParams: "wait_for_status=yellow&timeout=1s"' and it works!! This symptom may occur when replicas: 1 minimumMasterNodes: 1 Thank you very much @Jakub :) — Klaud Yu, Oct 28 '20 at 00:31
Happy to help. I have posted an answer with these informations. If this answer or any other one solved your issue, please mark it as accepted or upvote it as per [stackoverflow rules](https://stackoverflow.com/help/someone-answers). — Jakub, Oct 28 '20 at 09:18

score 9 · Accepted Answer · answered Oct 28 '20 at 09:16

Issue

There is an issue with elasticsearch readiness probe when running on single replica cluster.

Warning  Unhealthy               91s (x14 over 3m42s)  kubelet          Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )
Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )

Solution

As mentioned here by @adinhodovic

If your running a single replica cluster add the following helm value:

clusterHealthCheckParams: "wait_for_status=yellow&timeout=1s"

Your status will never go green with a single replica cluster.

The following values should work:

replicas: 1
minimumMasterNodes: 1
clusterHealthCheckParams: 'wait_for_status=yellow&timeout=1s'

K8s Elasticsearch with filebeat is keeping 'not ready' after rebooting

1 Answers1

Issue

Solution