AQL Upsert performance ArangoDB with python-arango

Question

I am experimenting on UPSERT-statements for multiple items (around 20 - 30k) using paython-arango and a aql query. Overall, this method seems to be much slower in python compared to NodeJS anyway but I tried two different methods and suprisingly the one I thought its faster is much slower and I try to understand why.

My initial method was to upsert every single item separatly with a statement like this:

        aql = "UPSERT {_key: '" + data["_key"] + "'}"
        aql += " INSERT " + doc
        aql += " UPDATE " + doc
        aql += " IN " + col_name
        
        self.db.aql.execute(aql)

As I said before, this was slower than a similar NodeJS script in NodeJS maybe because NodeJS works async. Overall approx 45s für 27k entries (2.4 MB csv-dump). A lot of time for so little data.

Then I thought, one by one could be the problem and it could be better to send multiple items per query to the db like this:

        aql = "FOR itm IN " + json.dumps(items)
        aql += " UPSERT {_key: itm._key}"
        aql += " INSERT itm"
        aql += " UPDATE itm"
        aql += " IN " + col_name

But actually, this was worse. Maybe 5 minutes and the ArangoDB-Backend was seriously slower and for a short time not even accessibly.

I am curious: Why is one single (ok, I used batches of 2, 10, 50, 100 items but no effect) aql-statement with FOR so much slower than 27k seperate requests? I expected otherwise. And is there a better way to perform multiple UPSERTs at once (with Python)?

Its a Windows-system with 2 Cores and 16 GB RAM. Reading performance is good but writing ...

mpoeter · Answer 1 · 2021-06-08T12:40:40.497

UPSERT performs a nested query to look for a matching document. However, that does not explain why the single query performs so much worse than the multiple queries, unless of course you perform these queries asynchronously/in parallel. I would have to look into this in more detail.

However, since you use the _key to identify your document and your INSERT and UPDATE statements look identical, you might be better off using overwriteMode for the INSERT instead (see https://www.arangodb.com/docs/stable/aql/operations-insert.html#query-options):

        aql = "FOR itm IN " + json.dumps(items)
        aql += " INSERT itm"
        aql += " IN " + col_name
        aql += " OPTIONS { overwriteMode: 'update' }"

But when working with a large number of documents you should consider using bulk document creation instead (https://www.arangodb.com/docs/stable/http/document-working-with-documents.html#create-multiple-documents) which also supports overwriteMode. However, I don't know if the python-arango driver supports these functions.

I want to join two data sources with different attributes of the same item. Would overwrite not delete the existing entries? — Thomas Dorloff, Jun 10 '21 at 06:52
Did you have a look at the documentation? When you set the `overwriteMode` to "update", the document is not replaced, but _updated_ - just like an update operation would. So in effect, this would be an "update or insert". — mpoeter, Jun 11 '21 at 14:06

AQL Upsert performance ArangoDB with python-arango

1 Answers1