1

I'm running an ELK stack on a T4g.medium box (arm & 4GB ram) on AWS. When using the official Kibana image I see weird behaviour where after approx 4 hours running the CPU spikes (50-60%) and the EC2 box becomes unreachable until restarted. 1 out of 2 status checks fail also. Once restarted it runs for another 4 or so hours then the same happens again. The instance is not under heavy load and it goes down in the middle of the night when there is no load. I'm 99.9% its Kibana causing the issue as gagara/kibana-oss-arm64:7.6.2 has ran for months without issue. Its not an ARM issue or Kibana 7.13 either as I've encountered the same with x86 on older versions of Kibana. Mu config is:

version: '3.8'

services:

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.13.0
    configs:
      - source: elastic_config
        target: /usr/share/elasticsearch/config/elasticsearch.yml
    environment:
      ES_JAVA_OPTS: "-Xmx2g -Xms2g"
    networks:
      - internal
    volumes:
      - /mnt/data/elasticsearch:/usr/share/elasticsearch/data
  
    deploy:
      mode: replicated
      replicas: 1

  logstash:
    image: docker.elastic.co/logstash/logstash:7.13.0
    ports:
      - "5044:5044"
      - "9600:9600"
    configs:
      - source: logstash_config
        target: /usr/share/logstash/config/logstash.yml
      - source: logstash_pipeline
        target: /usr/share/logstash/pipeline/logstash.conf
    environment:
      LS_JAVA_OPTS: "-Xmx1g -Xms1g"
    networks:
      - internal
    deploy:
      mode: replicated
      replicas: 1

  kibana:
    image: docker.elastic.co/kibana/kibana:7.13.0
    configs:
      - source: kibana_config
        target: /usr/share/kibana/config/kibana.yml
    environment:
      NODE_OPTIONS: "--max-old-space-size=300"
    networks:
      - internal
    deploy:
      mode: replicated
      replicas: 1
      labels:
        - "traefik.enable=true"
  load-balancer:
    image: traefik:v2.2.8
    ports:
      - 5601:443
    configs:
      - source: traefik_config
        target: /etc/traefik/traefik.toml
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    deploy:
      restart_policy:
        condition: any
      mode: replicated
      replicas: 1
    networks:
      - internal

      
configs:
  elastic_config:
    file: ./config/elasticsearch.yml
  logstash_config:
    file: ./config/logstash/logstash.yml
  logstash_pipeline:
    file: ./config/logstash/pipeline/pipeline.conf
  kibana_config:
    file: ./config/kibana.yml
  traefik_config:
    file: ./config/traefik.toml

networks:
  internal:
    driver: overlay

And I've disabled a pile of stuff in kibana.yml to see if that helped:

server.name: kibana
server.host: "0.0.0.0"
elasticsearch.hosts: ["http://elasticsearch:9200"]

xpack.monitoring.ui.enabled: false
xpack.graph.enabled: false
xpack.infra.enabled: false
xpack.canvas.enabled: false
xpack.ml.enabled: false
xpack.uptime.enabled: false
xpack.maps.enabled: false
xpack.apm.enabled: false
timelion.enabled: false

Has anyone encountered similar problems with a single node ELK stack running on Docker?

Steve Fitzsimons
  • 3,754
  • 7
  • 27
  • 66

0 Answers0