10

I have an elasticsearch index, my_index, with millions of documents, with key my_uuid. On top of that index I have several filtered aliases of the following form (showing only my_alias as retrieved by GET my_index/_alias/my_alias):

{
    "my_index": {
        "aliases": {
            "my_alias": {
                "filter": {
                    "terms": {
                        "my_uuid": [
                            "0944581b-9bf2-49e1-9bd0-4313d2398cf6",
                            "b6327e90-86f6-42eb-8fde-772397b8e926",
                            thousands of rows...
                        ]
                    }
                }
            }
        }
    }
}

My understanding is that the filter will be cached transparently for me, without having to do any configuration. The thing is I am experiencing very slow searches, when going through the alias, which suggests that 1. the filter is not cached, or 2. it is wrongly written.

Indicative numbers:

GET my_index/_search -> 50ms 
GET my_alias/_search -> 8000ms

I can provide further information on the cluster scale, and size of data if anyone considers this relevant.

I am using elasticsearch 2.4.1. I am getting the right results, it is just the performance that concerns me.

yannisf
  • 6,016
  • 9
  • 39
  • 61
  • what happens when you run the search query directly and add the filter that is applied to the alias. does it take time? – pratikvasa Feb 09 '17 at 13:52
  • 1
    Have you checked that `my_uuid` is `not_analyzed`? But thousands of terms on a filter seems quite heavy weight. If you know these uuids at index time you could add a new field `aliases` to each doc. Then your filter would just have a single term. – NikoNyrh Feb 09 '17 at 14:59
  • @NikoNyrh `my_uuid` is `not_analyzed`. Indeed I know them at index time, but they are dynamically updated in bulk, so I did not want to hard code them into the searchable documents. – yannisf Feb 09 '17 at 15:37
  • Hi @pratikvasa. I performed the test and got similar times. The thing is, that the query I have to send when not using the alias with the filter is around 4MB due to the number of the `my_uuid`s, and just uploading the query takes about 6 seconds. So I guess this is not considered a viable solution. – yannisf Feb 10 '17 at 13:51
  • ohk..by similar times you mean you are getting around 8 secs which includes 6 seconds to send the query? – pratikvasa Feb 10 '17 at 14:13
  • Filter caches are only returned after the 3rd hit to the same filter. If you run the same query to the alias multiple times, does the time taken go down? – Farid May 12 '18 at 00:50
  • No, it doesn't. Official answer I got from the Elastic forum was that this is unlikely to improve anytime soon and using such filters is an anti-pattern. – yannisf May 14 '18 at 18:52
  • @yannisf I know this is a really old thread, but when you say 'Official answer I got from Elastic forum', I'm wondering if it's possible to maybe add a link to it here for future readers. – Ayush Feb 25 '23 at 10:09

1 Answers1

0

Matching each document with a 4MB list of uids is definetly not the way to go. Try to imagine how many CPU cycles it requires. 8s is quite fast.

I would duplicate the subset of data in another index.

If you need to immediately reflect changes, you will have to manage the subset index by hand :

  • when you delete a uuid from the list, you delete the corresponding documents
  • when you add a uuid, you copy the corresponding documents (reindex api with a query is your friend)
  • when you insert a document, you have to check if the document should be added in subset index too
  • when you delete a document, delete it in both indices Force the document id so they are the same in both indices. Beware of refresh time if you store the uuid list in elasticsearch index.

If updating the subset with new uuid is not time critical, you can just run the reindex every day or every hour.

bokan
  • 3,601
  • 2
  • 23
  • 38