6

We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.

I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".

  1. There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
  2. there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
  3. Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
  4. On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
  5. I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.

So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.

Alexandre Rafalovitch
  • 9,709
  • 1
  • 24
  • 27
Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • interesting question; I would choose option #2 mixed with option #5; 1k of docs per request is good; I would also create a 55M unique docs by user_id empty before adding the new label and then update the docs – Ionut Flavius Pogacian Oct 17 '14 at 15:19

5 Answers5

3

While waiting for update by query support, I have opted for:

  1. Use the scan/scroll API to loop over the document IDs you want to tag (related answer).

  2. Use the bulk API to perform partial updates to set the tag on every matching doc.

Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.

Python snippet to illustrate the approach:

def actiongen():
    docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
    for doc in docs:
        yield {
            '_op_type': 'update',
            '_index': doc['_index'],
            '_type': doc['_type'],
            '_id': doc['_id'],
            'doc': {'tags': tags},
        }

helpers.bulk(es, actiongen(), index=args.index, stats_only=True)
Community
  • 1
  • 1
Anton
  • 4,411
  • 1
  • 15
  • 18
  • This is basically what the update-by-query plugin performs, with many added network round-trips. Although the plugin does not yet support partial documents, only scripting. – ofavre Nov 11 '14 at 23:11
  • @Teka, good to know. I guess you mean that the plugin does this *without* the many unnecessary network round-trips. – Anton Nov 12 '14 at 07:32
  • This is what I mean, indeed. – ofavre Nov 12 '14 at 15:59
2

Using the aforementioned update-by-query plugin, you would simply call:

curl -XPOST localhost:9200/index/type/_update_by_query -d '{
    "query": {"filtered": {"filter":{
        "not": {"term": {"tag": "github"}}
    }}},
    "script": "ctx._source.label = \"github\""
}'

The update-by-query plugin only accepts a script, not partial documents.

As for performance and memory issues, I guess the best thing is to give it a try.

ofavre
  • 4,488
  • 2
  • 26
  • 23
0

I'd go with the bulk API with the caveat that you should try to update each document the minimal number of times. Updates are just atomic deletes and adds and leave behind the deleted document as a tombstone until it can be merged out.

Sending a groovy script to execute the update probably makes the most sense here so you don't have to fetch the document first.

Nik
  • 739
  • 6
  • 10
  • Providing a partial document is more effective than scripting. The `Update API` doc provides this example: `curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{"doc":{"label":"github"},"detect_noop": true}'`. Note however that it cannot append to an array, it can only set fields to a given value or array of values. – ofavre Nov 11 '14 at 22:38
  • Just adding a link to Nik's pull request - https://github.com/elastic/elasticsearch/pull/15125 – Rob Bygrave Jan 29 '16 at 01:56
0

Could you create a Parent/Child relationship whereby you can add a 'tags' type which references your 'posts' type as its parent. This way you wouldn't need to perform a full reindex of your data - simply index each of the appropriate tags against the appropriate post ID.

Telax
  • 103
  • 1
  • 3
  • 8
  • This approach is indeed interesting if the documents are either quite big or frequently updated. However before choosing this solution, one should be aware that parent-child queries comes at a memory cost upon querying: all parent ids are loaded in memory during to perform a join efficiently. Depending on the performance needs, it may be more efficient to just update the original document once and for all. – ofavre Nov 11 '14 at 22:41
0

A very old thread. Landed through the github page to implement "update by query" to see if it's implemented in 2.0 but unluckily not. Thanks to plugin from Teka, if the update is small, that very much doable from sense but our use case was to update million of documents daily based on certain complex queries. At the end, we moved to es-hadoop connector. Although infrastructure is a big big overhead here but parallelizing the process of fetching/updating/inserting document through spark helped us anyhow. If anyone has any other suggestion discovered :) in past one year, would love to hear on that.

piyushGoyal
  • 1,079
  • 1
  • 11
  • 26