Elasticsearch cluster on Kubernetes - nodes are not communicating

Question

I have an Elasticsearch cluster (6.3) running on Kubernetes (GKE) with the following manifest file:

---
# Source: elasticsearch/templates/manifests.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: elasticsearch-configmap
  labels:
    app.kubernetes.io/name: "elasticsearch"
    app.kubernetes.io/component: elasticsearch-server
data:
  elasticsearch.yml: |
    cluster.name: "${CLUSTER_NAME}"
    node.name: "${NODE_NAME}"

    path.data: /usr/share/elasticsearch/data
    path.repo: ["${BACKUP_REPO_PATH}"]

    network.host: 0.0.0.0

    discovery.zen.minimum_master_nodes: 1
    discovery.zen.ping.unicast.hosts: ${DISCOVERY_SERVICE}
  log4j2.properties: |
    status = error

    appender.console.type = Console
    appender.console.name = console
    appender.console.layout.type = PatternLayout
    appender.console.layout.pattern = [%d{ISO8601}][%-5p][%-25c{1.}] %marker%m%n

    rootLogger.level = info
    rootLogger.appenderRef.console.ref = console
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  labels: &ElasticsearchDeploymentLabels
    app.kubernetes.io/name: "elasticsearch"
    app.kubernetes.io/component: elasticsearch-server
spec:
  selector:
    matchLabels: *ElasticsearchDeploymentLabels
  serviceName: elasticsearch-svc
  replicas: 2
  updateStrategy:
    # The procedure for updating the Elasticsearch cluster is described at
    # https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html
    type: OnDelete
  template:
    metadata:
      labels: *ElasticsearchDeploymentLabels
    spec:
      terminationGracePeriodSeconds: 180
      initContainers:
        # This init container sets the appropriate limits for mmap counts on the hosting node.
        # https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
        - name: set-max-map-count
          image: marketplace.gcr.io/google/elasticsearch/ubuntu16_04@...
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
          command:
            - /bin/bash
            - -c
            - 'if [[ "$(sysctl vm.max_map_count --values)" -lt 262144 ]]; then sysctl -w vm.max_map_count=262144; fi'
      containers:
        - name: elasticsearch
          image: eu.gcr.io/projectId/elasticsearch6.3@sha256:...
          imagePullPolicy: Always
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: CLUSTER_NAME
              value: "elasticsearch-cluster"
            - name: DISCOVERY_SERVICE
              value: "elasticsearch-svc"
            - name: BACKUP_REPO_PATH
              value: ""
          ports:
            - name: prometheus
              containerPort: 9114
              protocol: TCP
            - name: http
              containerPort: 9200
            - name: tcp-transport
              containerPort: 9300
          volumeMounts:
            - name: configmap
              mountPath: /etc/elasticsearch/elasticsearch.yml
              subPath: elasticsearch.yml
            - name: configmap
              mountPath: /etc/elasticsearch/log4j2.properties
              subPath: log4j2.properties
            - name: elasticsearch-pvc
              mountPath: /usr/share/elasticsearch/data
          readinessProbe:
            httpGet:
              path: /_cluster/health?local=true
              port: 9200
            initialDelaySeconds: 5
          livenessProbe:
            exec:
              command:
                - /usr/bin/pgrep
                - -x
                - "java"
            initialDelaySeconds: 5
          resources:
            requests:
              memory: "2Gi"

        - name: prometheus-to-sd
          image: marketplace.gcr.io/google/elasticsearch/prometheus-to-sd@sha256:8e3679a6e059d1806daae335ab08b304fd1d8d35cdff457baded7306b5af9ba5
          ports:
            - name: profiler
              containerPort: 6060
          command:
            - /monitor
            - --stackdriver-prefix=custom.googleapis.com
            - --source=elasticsearch:http://localhost:9114/metrics
            - --pod-id=$(POD_NAME)
            - --namespace-id=$(POD_NAMESPACE)
            - --monitored-resource-types=k8s
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

      volumes:
        - name: configmap
          configMap:
            name: "elasticsearch-configmap"
  volumeClaimTemplates:
    - metadata:
        name: elasticsearch-pvc
        labels:
          app.kubernetes.io/name: "elasticsearch"
          app.kubernetes.io/component: elasticsearch-server
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: standard
        resources:
          requests:
            storage: 50Gi
---
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch-prometheus-svc
  labels:
    app.kubernetes.io/name: elasticsearch
    app.kubernetes.io/component: elasticsearch-server
spec:
  clusterIP: None
  ports:
    - name: prometheus-port
      port: 9114
      protocol: TCP
  selector:
    app.kubernetes.io/name: elasticsearch
    app.kubernetes.io/component: elasticsearch-server
---
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch-svc-internal
  labels:
    app.kubernetes.io/name: "elasticsearch"
    app.kubernetes.io/component: elasticsearch-server
spec:
  ports:
    - name: http
      port: 9200
    - name: tcp-transport
      port: 9300
  selector:
    app.kubernetes.io/name: "elasticsearch"
    app.kubernetes.io/component: elasticsearch-server
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
  name: ilb-service-elastic
  annotations:
    cloud.google.com/load-balancer-type: "Internal"
  labels:
    app: elasticsearch-svc
spec:
  type: LoadBalancer
  loadBalancerIP: some-ip-address
  selector:
    app.kubernetes.io/component: elasticsearch-server
    app.kubernetes.io/name: elasticsearch
  ports:
    - port: 9200
      protocol: TCP

This manifest was written from the template that used to be available on the GCP marketplace.

I'm encountering the following issue: the cluster is supposed to have 2 nodes, and indeed 2 pods are running. However

a call to ip:9200/_nodes returns just one node
there still seems to be a second node running that receives traffic (at least, read traffic), as visible in the logs. Those requests typically fail because the requested entities don't exist on that node (just on the master node).

I can't wrap my head around the fact that the node at the same time isn't visible to the master node, and receives read traffic from the load balanced pointing to the stateful set.

Am I missing something subtle ?

Harsh Manvar · Answer 1 · 2022-03-22T13:32:05.207

Did you try checking which types of both Nodes are?

There are Master nodes and data nodes, at a time only one master gets elected while the other just stay in the background if the first master node goes down new Node gets elected and handles the further request.

i cant see Node type config in stateful sets. i would recommand checking out the helm of Elasticsearch to set up and deploy on GKE.

Helm chart : https://github.com/elastic/helm-charts/tree/main/elasticsearch

Sharing example Env config for reference :

env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: CLUSTER_NAME
          value: my-es
        - name: NODE_MASTER
          value: "false"
        - name: NODE_INGEST
          value: "false"
        - name: HTTP_ENABLE
          value: "false"
        - name: ES_JAVA_OPTS
          value: -Xms256m -Xmx256m

read more at : https://faun.pub/https-medium-com-thakur-vaibhav23-ha-es-k8s-7e655c1b7b61

Elasticsearch cluster on Kubernetes - nodes are not communicating

1 Answers1