I have an application where I put a spark dataframe on elasticsearch
using below code
inputDF.write.format("org.elasticsearch.spark.sql")
.mode(SaveMode.Append)
.option("es.resource", "{date}/" + type).save()
With these properties
spark.es.batch.size.bytes=5000000
spark.es.batch.size.entries=5000
spark.es.batch.write.refresh=false
I have multiple instances of standalone spark submits (100-150) on some machines that do the same job and push to elasticsearch.
On Elasticsearch side I have a Master node and 9 Data Nodes with 1TB capacity and enough RAM (15 GB - 64GB)
I create default mapping for index using these parameters
"number_of_shards": "45",
"refresh_interval": "-1",
"number_of_replicas": "0"
Here are stats of a particular index
"total" : {
"docs" : {
"count" : 2083251258,
"deleted" : 0
},
"store" : {
"size_in_bytes" : 1814616254253,
"throttle_time_in_millis" : 0
},
"indexing" : {
"index_total" : 1703849459,
"index_time_in_millis" : 739162810,
"index_current" : 0,
"index_failed" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
}
}
It looks like indexing rate is only ~2k per second. Also I keep getting errors of failed entries something like this
Could not write all entries [99/347072] (maybe ES was overloaded?).
What Can I do better to quickly insert documents without errors.