We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.
I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".
- There is the classic partial
update
endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them. - there is the
bulk
request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go. - Then there is the official open issue for
/update_by_query
endpoint which has lots of traffic, but no confirmation it was implemented in the standard release. - On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
- I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.
So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.