Elasticsearch indexing rate is very slow with spark bulk insert

Question

I have an application where I put a spark dataframe on elasticsearch

using below code

inputDF.write.format("org.elasticsearch.spark.sql")
          .mode(SaveMode.Append)
          .option("es.resource", "{date}/" + type).save()

With these properties

spark.es.batch.size.bytes=5000000
spark.es.batch.size.entries=5000
spark.es.batch.write.refresh=false

I have multiple instances of standalone spark submits (100-150) on some machines that do the same job and push to elasticsearch.

On Elasticsearch side I have a Master node and 9 Data Nodes with 1TB capacity and enough RAM (15 GB - 64GB)

I create default mapping for index using these parameters

"number_of_shards": "45",
"refresh_interval": "-1",
"number_of_replicas": "0"

Here are stats of a particular index

"total" : {
      "docs" : {
        "count" : 2083251258,
        "deleted" : 0
      },
      "store" : {
        "size_in_bytes" : 1814616254253,
        "throttle_time_in_millis" : 0
      },
      "indexing" : {
        "index_total" : 1703849459,
        "index_time_in_millis" : 739162810,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
}

It looks like indexing rate is only ~2k per second. Also I keep getting errors of failed entries something like this

Could not write all entries [99/347072] (maybe ES was overloaded?).

What Can I do better to quickly insert documents without errors.

https://github.com/elastic/elasticsearch-hadoop/issues/628 maybe this will help — Vova Bilyachat, Aug 02 '18 at 06:28
Did you look at the Elasticsearch log file and see the errors? — Remot, Aug 02 '18 at 07:05
Take a look at this https://stackoverflow.com/questions/47453244/write-to-elasticsearch-from-spark-is-very-slow/47471737#47471737 — eliasah, Aug 02 '18 at 08:00

Elasticsearch indexing rate is very slow with spark bulk insert

0 Answers0