1

I have an application where I put a spark dataframe on elasticsearch

using below code

inputDF.write.format("org.elasticsearch.spark.sql")
          .mode(SaveMode.Append)
          .option("es.resource", "{date}/" + type).save()

With these properties

spark.es.batch.size.bytes=5000000
spark.es.batch.size.entries=5000
spark.es.batch.write.refresh=false

I have multiple instances of standalone spark submits (100-150) on some machines that do the same job and push to elasticsearch.

On Elasticsearch side I have a Master node and 9 Data Nodes with 1TB capacity and enough RAM (15 GB - 64GB)

I create default mapping for index using these parameters

"number_of_shards": "45",
"refresh_interval": "-1",
"number_of_replicas": "0"

Here are stats of a particular index

"total" : {
      "docs" : {
        "count" : 2083251258,
        "deleted" : 0
      },
      "store" : {
        "size_in_bytes" : 1814616254253,
        "throttle_time_in_millis" : 0
      },
      "indexing" : {
        "index_total" : 1703849459,
        "index_time_in_millis" : 739162810,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
}

It looks like indexing rate is only ~2k per second. Also I keep getting errors of failed entries something like this

Could not write all entries [99/347072] (maybe ES was overloaded?).

What Can I do better to quickly insert documents without errors.

hard coder
  • 5,449
  • 6
  • 36
  • 61

0 Answers0