1

I am using the following code to write a pyspark dataframe to elasticsearch via AWS Glue.

df.write.format("org.elasticsearch.spark.sql").\
    mode("overwrite").\
    option("es.resource", "{}/_doc".format(es_index_name)).\
    option("es.nodes", es_node_url).\
    option("es.port", es_node_port).\
    option("es.nodes.wan.only", "true").\
    options(**es_conf).\
    save()

My question is, is there a way to control how fast glue/pyspark submits write operations to Amazon Elasticsearch (ES)? Because the glue job was not able to finish due to errors thrown by ES caused by heavy many writes. Currently, I am trying to find the optimal no. of glue workers to spawn and the optimal ES configuration so it won't happen but I'm dubious that this trial and error approach is the most efficient way to deal with this kind of problem. Thanks in advance.

0 Answers0