Individually update a large amount of documents with the Python DSL Elasticsearch UpdateByQuery

Question

I'm trying to use the UpdateByQuery to update a property of a large amount of documents. But as each document will have a different value, I need to execute ir one by one. I'm traversing a big amount of documents, and for each document I call this funcion:

def update_references(self, query, script_source):

    try:
        ubq = UpdateByQuery(using=self.client, index=self.index).update_from_dict(query).script(source=script_source)
        ubq.execute()

    except Exception as err:
        return False

    return True

Some example values are:

query = {'query': {'match': {'_id': 'VpKI1msBNuDimFsyxxm4'}}}
script_source = 'ctx._source.refs = [\'python\', \'java\']'

The problem is that when I do that, I got an error: "Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.max_compilations_rate] setting".

If I change the max_compilations_rate using Kibana, it has no effect:

PUT _cluster/settings
{
  "transient": {
    "script.max_compilations_rate": "1500/1m"
  }
}

Anyway, it would be better to use a parametrized script. I tried:

def update_references(self, query, script_source, script_params):

    try:
        ubq = UpdateByQuery(using=self.client, index=self.index).update_from_dict(query).script(source=script_source, params=script_params)
        ubq.execute()

    except Exception as err:
        return False

    return True

So, this time:

script_source = 'ctx._source.refs = params.value'
script_params = {'value': [\'python\', \'java\']}

But as I have to update the query and the parameters each time, I need to create a new instance of the UpdateByQuery for each document in the large collection, and the result is the same error.

I also tried to traverse and update the large collection with:

es.update(
    index=kwargs["index"],
    doc_type="paper",
    id=paper["_id"],
    body={"doc": {
        "refs": paper["refs"]  # e.g. [\\'python\\', \\'java\\']
    }}
)

But I'm getting the following error: "Failed to establish a new connection: [Errno 99] Cannot assign requested address juil. 10 18:07:14 bib gunicorn[20891]: POST http://localhost:9200/papers/paper/OZKI1msBNuDimFsy0SM9/_update [status:N/A request:0.005s"

So, please, if you have any idea on how to solve this it will be really appreciated. Best,

Why not run a single UpdateByQuery with a query that matches all the documents that need to be updated instead of running one-per-ID? The UpdateByQuery-per-id strategy is going to be orders of magnitude slower and more expensive than firing off a single UpdateByQuery that hits all your documents. — rusnyder, Jul 10 '19 at 16:40
Hi @rusnyder, thank you for your feedback. You're right, I'll try to create a script for the full collection of documents and iterate the values inside the script. I hope there is no limit on the script :) I'll let you know how it goes :) — gal007, Jul 11 '19 at 07:49
@rusnyder even if I process from 5 documents at a time, I got a "exceeded max allowed inline script size in bytes" :( — gal007, Jul 11 '19 at 08:49
So that’s close, but not _exactly_ what I was recommending. Think of the two main components of an UpdateByQuery: a `query`, which selects which documents to update, and a `script`, which is used once on _each_ document. If you’re trying to make the same set of changes to _multiple_ documents, then just modify the `query` to select all those documents, and leave the script simple and as specified in your example above. Does this make sense? — rusnyder, Jul 11 '19 at 11:58
@rusnyder yes, but I also need to update the script, since the parameters are not the same for all the documents. If I use the full collection in the parameters, so I can reuse it, the request legth is too big and it crashes — gal007, Jul 11 '19 at 15:37
Got it. If you need to update the parameters per request, then I don't think UpdateByQuery is the right tool for the job. Instead, I'd iterate through the set of documents you need to update (like you're doing), generate an update request for each one, and submit them in batches using the [Bulk API](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/docs-bulk.html). A sample update operation might look like: `{"update":{"_index":"...","_id":"abc123"}}\n{"doc":{"refs":["python","java"]}}` (I'm using partial doc update, but you can use scripted update if necessary) — rusnyder, Jul 11 '19 at 15:53

score 0 · Answer 1 · answered Oct 19 '21 at 21:26

0

You can try it like this.

PUT _cluster/settings
{
    "persistent" : {
        "script.max_compilations_rate" : "1500/1m"
    }
}

The version update is causing these errors.

answered Oct 19 '21 at 21:26

Mohammad Mukati

26
4

Individually update a large amount of documents with the Python DSL Elasticsearch UpdateByQuery

1 Answers1