-1

We are planning to use Filtered Aliases as mentioned here - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html

Our input data is going to be a stream with each line of the stream corresponding to an object we would like to store in ES.

Each object contains an 'id', which we are using for routing and filtering.

QUESTION - How do we create alias and index data in a performant way ?

-- Do we index all data, keep track of all the unique 'id's and the very end create the filtered alias ? OR

-- For each object, check if an alias for that 'id' exists; if it doesn't create one ?

I'm leaning towards the first approach. Is it advisable and performant when compared to the second approach ?

TIA.

curiouscoder
  • 81
  • 3
  • 9
  • How many indices do you have? How much data are you going to index? After you've indexed your initial set of data, will there be more data coming with different ids that would require the creation of new filtered aliases? – Val May 28 '15 at 03:10
  • @Val - There's only one index for my application ( we have a common ES cluster that other application share as well). It's a daily job that is executed which populates the data in ES, we for see it to be no more than a 10s of million entries (on each run). Yes, on each subsequent run, we may be adding new entries corresponding to different ids, which would require new filtered aliases to be created. – curiouscoder May 28 '15 at 05:48
  • Ok and one more thing: what's the cardinality of that id field (i.e. how many different unique ids are there approximately)? – Val May 28 '15 at 05:57
  • @Val - There should be no more than 100 unique ids and thus, as many filtered aliases. – curiouscoder May 28 '15 at 06:27
  • Ok, the whole purpose of aliases is to be able to federate many indices under one logical name, but since you have a single index, you don't really need aliases at all. If you use routing on that specific id field, it would be more than sufficient and efficient to achieve what you need in my opinion. – Val May 28 '15 at 06:31
  • I see. Our use case here is very similar to what is discussed here in the blog which is what we used as a starting point since we were starting out - http://engineering.aweber.com/using-elasticsearchs-aliases/ – curiouscoder May 28 '15 at 06:47
  • Its my understanding that Routing is used 'route' the request to a particular shard. What role does the filter in the alias play ? Is it during indexing or searching ? I tried to look through the docs but could find on clear answer – curiouscoder May 28 '15 at 06:53

1 Answers1

3

Based on our discussion above and after having glanced over the blog article you posted, I'm pretty positive that in your case you don't need aliases at all and the routing key would suffice. Again, only because you have a single index, if you had many indices this would not be true anymore!

You simply need to specify the routing key to use when indexing your document. Until ES 2.0, you can use the _routing field for that purpose, even though it's been deprecated in ES 1.5, but in your case it serves your purpose.

{
    "customer" : {
        "_routing" : {
            "required" : true,
            "path" : "customer_id"     <----- the field you use as the routing key
        },
        "properties": { ... }
    }
}

Then when searching you simply need to specify &routing=<customer_id> in your search URL in addition to your customer id filter (since a given shard can host documents for different customers). Your search will go directly to the shard identified by the given routing key, and thus, only retrieve data from the specified customer.

Using a filtered alias for this brings nothing as the filter and routing key you'd include in your alias definition would not contribute anything additional, since the retrieved documents are already "filtered" (kind of) by the routing key. This is way easier than trying to detect (on each new document to index) if an alias exists or not and create it if it doesn't.

UPDATE:

Now if you absolutely have/want to create filtered aliases, the more performant way would be the first one you mentioned:

  1. First index your daily data
  2. Then run a terms aggregation on your customer_id field with size high enough (i.e. higher than the cardinality of the field, which was ~100 in your case) to make sure you capture all unique customer ids to create your aliases
  3. Loop over all the buckets to retrieve all unique customer ids
  4. Create all aliases in one shot using one action for each customer_id
curl -XPOST 'http://localhost:9200/_aliases' -d '{
    "actions" : [
        {
            "add" : {
                 "index" : "customers",
                 "alias" : "alias_cid1",
                 "routing" : "cid1",
                 "filter" : { "term" : { "customer_id" : "cid1" } }
            }
        },
        {
            "add" : {
                 "index" : "customers",
                 "alias" : "alias_cid2",
                 "routing" : "cid2",
                 "filter" : { "term" : { "customer_id" : "cid2" } }
            }
        },
        {
            "add" : {
                 "index" : "customers",
                 "alias" : "alias_cid3",
                 "routing" : "cid3",
                 "filter" : { "term" : { "customer_id" : "cid3" } }
            }
        },
        ...
    ]
}'

Note that you don't have to worry if an alias already exists, the whole command won't fail and silently ignore the existing alias.

When this command has run, you'll have all your aliases on your unique index, properly configured with a filter and a routing key.

Val
  • 207,596
  • 13
  • 358
  • 360
  • Thank you very much, Val. We are planning to upgrade to ES 1.5 soon, so _routing might probably not work ? In that case, what would be an alternative approach ? – curiouscoder May 28 '15 at 07:29
  • You can always use the routing parameter when indexing your documents, whether [one-off](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-routing) or via the [bulk API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html#bulk-routing) – Val May 28 '15 at 07:32
  • I understand now :-). So, what's the significance of the filter alias ? Just for the sake of full disclosure, what approach would be most performant ( my original question) if I were indeed creating a filtered alias ? Thanks. – curiouscoder May 28 '15 at 18:50