2

I'm trying to index a large amount of data into Elasticsearch. Data comes from a CSV file which is around 41 GB and contains around 100 million rows. I'm using Elasticsearch Python client for this task. The code looks more or less like this:

es = Elasticsearch(
    hosts=[es_host],
    http_auth=('username', 'password'),
    timeout=120,
    max_retries=10,
    retry_on_timeout=True
)

progress = tqdm.tqdm(unit='docs')
successes = 0

logger.info(f'Indexing {file_path}')
for ok, action in streaming_bulk(
        client=es,
        max_chunk_bytes=15 * 1024 * 1024,
        actions=bulk_data_generator(index_name=index, file_path=file_path)
):
    progress.update(1)
    successes += ok
logger.info('Success!')

The bulk_data_generator is a generator which reads the CSV line by line and yield the request body for the Elasticsearch bulk method.

For a smaller CSV file (around 120 MB with around 100 thousand rows), the code works perfectly fine. But for the big file, I got OutOfMemoryError. The log of the Elasticsearch contains some information regarding the garbage collector like this:

{"type": "server", "timestamp": "2021-10-13T15:02:56,234Z", "level": "INFO", "component": "o.e.i.b.HierarchyCircuitBreakerService", "cluster.name": "docker-cluster", "node.name": "7476779d6cca", "message": "attempting to trigger G1GC due to high heap usage [8391218448]",
"cluster.uuid": "xMozIZtHRCS86sXTrngOpA", "node.id": "uGpM5oqjSgKEwVHnw6mH0A"  }
{"type": "server", "timestamp": "2021-10-13T15:02:56,405Z", "level": "INFO", "component": "o.e.i.b.HierarchyCircuitBreakerService", "cluster.name": "docker-cluster", "node.name": "7476779d6cca", "message": "GC did not bring memory usage down, before [8391218448], after [8
401704208], allocations [42], duration [171]", "cluster.uuid": "xMozIZtHRCS86sXTrngOpA", "node.id": "uGpM5oqjSgKEwVHnw6mH0A"  }
{"type": "server", "timestamp": "2021-10-13T15:02:58,158Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "7476779d6cca", "message": "[gc][16553] overhead, spent [265ms] collecting in the last [1s]", "cluster.u
uid": "xMozIZtHRCS86sXTrngOpA", "node.id": "uGpM5oqjSgKEwVHnw6mH0A"  }
{"type": "server", "timestamp": "2021-10-13T15:03:03,161Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "7476779d6cca", "message": "[gc][16558] overhead, spent [291ms] collecting in the last [1s]", "cluster.u
uid": "xMozIZtHRCS86sXTrngOpA", "node.id": "uGpM5oqjSgKEwVHnw6mH0A"  }
{"type": "server", "timestamp": "2021-10-13T15:03:04,250Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "7476779d6cca", "message": "[gc][16559] overhead, spent [346ms] collecting in the last [1s]", "cluster.u
uid": "xMozIZtHRCS86sXTrngOpA", "node.id": "uGpM5oqjSgKEwVHnw6mH0A"  }
{"type": "server", "timestamp": "2021-10-13T15:03:11,420Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "7476779d6cca", "message": "[gc][16566] overhead, spent [325ms] collecting in the last [1s]", "cluster.u
uid": "xMozIZtHRCS86sXTrngOpA", "node.id": "uGpM5oqjSgKEwVHnw6mH0A"  }
{"type": "server", "timestamp": "2021-10-13T15:03:17,432Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "7476779d6cca", "message": "[gc][16572] overhead, spent [531ms] collecting in the last [1s]", "cluster.u
uid": "xMozIZtHRCS86sXTrngOpA", "node.id": "uGpM5oqjSgKEwVHnw6mH0A"  }
{"type": "server", "timestamp": "2021-10-13T15:03:22,481Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "7476779d6cca", "message": "[gc][16577] overhead, spent [369ms] collecting in the last [1s]", "cluster.u
uid": "xMozIZtHRCS86sXTrngOpA", "node.id": "uGpM5oqjSgKEwVHnw6mH0A"  }

And then, the exception looks like this:

{"type": "server", "timestamp": "2021-10-13T15:04:21,079Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "docker-cluster", "node.name": "7476779d6cca", "message": "fatal error in thread [elasticsearch[7476779d6cca][write][T
#1]], exiting", "cluster.uuid": "xMozIZtHRCS86sXTrngOpA", "node.id": "uGpM5oqjSgKEwVHnw6mH0A" ,
"stacktrace": ["java.lang.OutOfMemoryError: Java heap space",
"at java.util.Arrays.copyOf(Arrays.java:3536) ~[?:?]",
"at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100) ~[?:?]",
"at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:130) ~[?:?]",
"at com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:2137) ~[jackson-core-2.10.4.jar:2.10.4]",
"at com.fasterxml.jackson.core.json.UTF8JsonGenerator.writeString(UTF8JsonGenerator.java:506) ~[jackson-core-2.10.4.jar:2.10.4]",
"at org.elasticsearch.common.xcontent.json.JsonXContentGenerator.writeString(JsonXContentGenerator.java:271) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.xcontent.XContentBuilder.value(XContentBuilder.java:654) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.xcontent.XContentBuilder.lambda$static$14(XContentBuilder.java:95) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.xcontent.XContentBuilder$$Lambda$50/0x0000000800c2f100.write(Unknown Source) ~[?:?]",
"at org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:811) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:891) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:818) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.xcontent.XContentBuilder.value(XContentBuilder.java:920) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:820) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:891) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:866) ~[elasticsearch-x-content-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.index.IndexRequest.source(IndexRequest.java:443) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.update.UpdateHelper.prepareUpdateScriptRequest(UpdateHelper.java:233) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.update.UpdateHelper.prepare(UpdateHelper.java:82) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.update.UpdateHelper.prepare(UpdateHelper.java:63) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:220) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:158) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:203) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:109) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:74) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:172) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.14.0.jar:7.14.0]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]",
"at java.lang.Thread.run(Thread.java:831) [?:?]"] }

I'm running Elasticsearch version 7.14 in a Docker container. Here is the docker-compose.yml file:

version: "3.7"
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.14.0
    container_name: es01
    restart: always
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms8g -Xmx8g"
      - TAKE_FILE_OWNERSHIP=true
      - xpack.security.enabled=true
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data01:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - elastic

  kib01:
    image: docker.elastic.co/kibana/kibana:7.14.0
    container_name: kib01
    restart: always
    ports:
      - "5601:5601"
    environment:
      ELASTICSEARCH_URL: http://es01:9200
      ELASTICSEARCH_HOSTS: '["http://es01:9200"]'
      SERVER_PUBLICBASEURL: http://example.com:5601
      ELASTICSEARCH_USERNAME: kibana_system
      ELASTICSEARCH_PASSWORD: "${KIBANA_SYSTEM_PASSWORD}"
    networks:
      - elastic

networks:
  elastic:
    name: elastic

volumes:
  data01:

I know that one solution is to increase the resource (higher heap size, add more node). But first I want to understand why this OutOfMemoryError happens to find the best solution.

I only send maximum 15 MB each request, and the Elasticsearch has 8 GB heap size. What did I do wrong here?

Triet Doan
  • 11,455
  • 8
  • 36
  • 69
  • 1
    Have you tried sending bulks smaller than 15MB? – Val Oct 13 '21 at 15:44
  • No, I didn't. By default, the bulk size is 100 MB. So, I think 15 MB is already much smaller. – Triet Doan Oct 13 '21 at 16:19
  • 1
    It depends on your node size. 100MB is the maximum bulk size, but it doesn't mean it's the right one for your use case, and neither is 15MB – Val Oct 13 '21 at 16:54
  • But why is bulk size the problem here? Is it because with big bulk size, it takes longer to process and things stack up? – Triet Doan Oct 13 '21 at 17:16
  • 1
    Yes, that's correct. You should decrease the bulk size down to a level that can be handled by your cluster – Val Oct 13 '21 at 17:18
  • first I just wanted to say thank you so much for putting everything you did in your question, it makes it heaps easier to help with this level of detail! 2nd, definitely try reducing your `_bulk` size as Val mentioned – warkolm Oct 13 '21 at 22:41
  • I increased the heap size to 12 GB and keep the bulk size at 5 MB and 2 minutes timeout. There is no `OutOfMemoryError` anymore, but the client get `ReadTimeoutError`. It seems that the server does not respond when the GC is running. I can try to increase the timeout... – Triet Doan Oct 14 '21 at 15:29

0 Answers0