3

I've just starated devising an ElasticSearch mapping for a multitenant web app. In this app, there are site ID:s and page ID:s. Page ID:s are unique per site, and randomly generated. Pages can have child pages.

What is best:

1) Use a compound key with site + page-ID:s? Like so:

"sitePageIdPath": "(siteID):(grandparent-page-ID).(parent-page-ID).(page-ID)"

or:

2) Use separate fields for site ID and page IDs? Like so:

"siteId": "(siteID)",
"pageIdPath": "(grandparent-page-ID).(parent-page-ID).(page-ID)"

?

I'm thinking that if I merge site ID and page IDs into one single field, then ElasticSearch will need to handle only that field, and this should be somewhat more performant than using two fields — both when indexing and when searching? And require less storage space.

However perhaps there's some drawback that I'm not aware about? Hence this question.

Some details: 1) I'm using a single index, and I'm over allocating shards (100 shards), as suggested when one uses the "users" data flow pattern. 2) I'm specifying routing parameters explicitly in the URL (i.e. &routing=site-ID), not via any siteId field in the documents that are indexed.

Update 7 hours later:

1) All queries should be filtered by site id (that is, tenant id). If I do combine the site ID with the page ID, I suppose/hope that I can use a prefix filter, to filter on site ID. I wonder if this will be as fast as filtering on a single dedicated siteId field (e.g. can the results be cached).

2) Example queries: Full text search. List all users. List all pages. List all child/successor pages of a certain page. Load a single page (via _source).

Update 22 hours later:

3) I am able to search by page ID, because as ElasticSearch's _id, I store: (site-ID):(page-ID). So it's not a probolem that the page ID is otherwise "hidden" as the last element of pageIdPath. I probably should have mentioned earlier that I had a separate page ID field, but I thought let's keep the question short.

4) I use index: not_analyzed for these ID fields.

KajMagnus
  • 11,308
  • 15
  • 79
  • 127

2 Answers2

3

There are performance issues when indexing and searching if you use 1 field. I think you're mistaken in thinking 1 filed would speed things up.

If using 1 field you have basically 2 mapping choices:

  1. If you use the default mappings, the string (siteID):(grandparent-page-ID).(parent-page-ID).(page-ID) will get broken up by the analyzer to the tokens (siteID) (grandparent-page-ID) (parent-page-ID) (page-ID). Now your ids are like a bag of words and either a term or prefix filter might find a match from the pageID when you meant for it to match the siteID.

  2. If you set your own analyzer (and I would like to know if you can think of a good way of doing this) the first one that comes to mind is the keyword (or not_analyzed) analyzer. This will keep the string as one token so you don't lose the context. However now you have a big performance hit when using a prefix filter. Imagine I index the string "123.456.789" as one token (siteID,parentpageID.pageID). I want to fileter by sideID = 123 and so I use a prefix filter. As you can read here this prefix filter is actually expaned into a bool query of hundreds of terms all ORed together (123 or 1231 or 1232 or 1233 etc...), which is massive waste of computing power when you could just structure your data better.

I urge you to read more about lucene's PrefixQuery and how it works.

If I were you I would do this.

Mapping

"properties": {
  "site_id": {
    "type": "string",
    "index": "not_analyzed" //keyword would also work here, they are basically the same
  },
  "parent_page_id": {
    "type": "string",
    "index": "not_analyzed"
  },
  "page_id": {
    "type": "string",
    "index": "not_analyzed"
  }<
  "page_content": {
    "type": "string",
    "index": "standard" //you may want to use snowball to enable stemming
  }
}

Queries

Text search for "elasticsearch tutorial" under siteID "123"

"filtered": {
  "query": {
    "match": {
      "page_content": "elasticsearch tutorial"
    }
  },
  "filter": {
    "term": {
      "site_id": "123"
    }
  }
}

All child pages of page "456" under site "123"

"filtered": {
  "query": {
    "match_all": {}
  },
  "filter": {
    "and": [
      {
        "term": {
          "site_id": "123"
        }
      },
      {
        "term": {
          "parent_page_id": "456"
        }
      }
  }
}
Community
  • 1
  • 1
ramseykhalaf
  • 3,371
  • 2
  • 17
  • 16
  • Thanks for this detailed answer! I didn't know that prefix queries are transformed to boolean queries, I was a bit surprised :-) I do use *index: not_analyzed*. — I'll rewrite my mapping to match the one you suggested. (Searching directly for the page ID was already possible though, because it's included in *_id*, like so: *(site-ID):(page-ID)". ) — I updated my downvoted answer with clarifications on what are the problems with that answer. – KajMagnus Jul 29 '13 at 05:36
  • But I'm never using a PrefixQuery? It'd be a PrefixFilter. Hmm, [this Lucene FAQ](http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F) says that a workaround for the *TooManyClauses* is to *"Use a filter to replace the part of the query that causes the exception"* — which gives me the impression that PrefixFilter:s don't expand to a list of boolean clauses? Which seems weird though — why only expand PrefixQuery but not PrefixFilter. – KajMagnus Jul 29 '13 at 05:51
  • Just think about this from a computer science point of view. Searching is basically looking things up in an index (probably a hash map of some sort). How would you search for documents where there is a long string and you want to have a certain prefix... Either you're going to have to go through **every** doc or you are going to have to expand the prefix and hash all the possible expansions and look up the documents in index. A term filter is so simple. Hash the term, then look up in the hash table for that field and get the list of docs which match **immediately**. – ramseykhalaf Jul 29 '13 at 05:59
  • 1
    One can use a [prefix tree](http://en.wikipedia.org/wiki/Trie) too — that's what I was thinking that Lucene perhaps did, somehow. Relational databases do prefix searches efficiently (without converting to boolean clauses) when you search for `... where some_column like 'prefix%'` — via index range scans in tree indexes I think. (I should be a bit careful with assumptions about how Lucene works based on how relational databases work, it seems) – KajMagnus Jul 29 '13 at 06:17
  • Thanks for the link! I hadn't thought of that. I don't know how the index is stored in lucene/es. Would one store the entire result set at each node (surely too much space), or you'd have to traverse the tree to the leaves (with your favourite traversal algorithm) to generate the set of hits. – ramseykhalaf Jul 29 '13 at 06:48
  • I'd say one would traverse the three to reach the leaves. At the leaves, however, one can store links to the next and previous leaves, so once one has reached a leaf, it's a linked list traversal, to retrieve all results. [Here's a B+ tree example linking](http://en.wikipedia.org/wiki/B%2B_tree). – KajMagnus Jul 29 '13 at 11:19
0

Edit: There's a problem with this answer, namely possible BooleanQuery.TooManyClauses exceptions; please see the update below, after the original answer. /Edit

I think it's okay to combine the site ID and the page ID, and use [a prefix filter that matches on the site ID] when querying. I found this info in the Query DSL docs:

Some filters already produce a result that is easily cacheable, and the difference between caching and not caching them is the act of placing the result in the cache or not. These filters, which include the term, terms, prefix, and range filters

So combining site ID and page ID should be okay w.r.t. performance I think. And I cannot think of any other issues (keeping in mind that looking up by page ID only makes no sense, since the page ID means nothing without the site ID.)


Update:

I'd guess the downvote is mainly 1) because there are performance issues if I combine (Site-ID):(Parent-page-ID):(Page-ID) into one field, and then try to search for the page ID. However the page ID is available in the _id field, which is: (site-ID):(page-ID), so this should not be an issue. (That is, I'm not using only 1 field — I'm using 2 fields.)

The queries that corresponds to Ramseykhalaf's queries would then be:

"filtered": {
  "query": {
    "match": {
      "page_content": "search phrase"
    }
  },
  "filter" : {
    "prefix" : {
      "_id" : "123:"    // site ID is "123"
    }
  }
}

And:

"filtered": {
  "query": {
    "match_all": {}
  },
  "filter": {
    "and": [{
      "prefix" : {
        "_id" : "123:"  // site ID is "123"
      }, {
      "prefix": {
        "pageIdPath": "456:789:"  // section and sub section IDs are 456:789
                               // (I think I'd never search for a *subsection* only,
                               // without also knowing the parent section ID)
      }
    }]
  }
}

(I renamed sitePageIdPath to pageIdPath since site-ID is stored in _id)


Another 2) minor reason for the downvote might be that (and I didn't know about this until now) prefix queries are broken up to boolean queries that match on all terms with the specified prefix, and these boolean queries could in my case include really really many terms, if there are really really many pages (there might be) or section IDs (there aren't) in the relevant website. So using a term query directly is faster? And cannot result in a too-many-clauses exception (see link below).

For more info on PrefixQuery, see:
How to improve a single character PrefixQuery performance? and
With Lucene: Why do I get a Too Many Clauses error if I do a prefix search?

This to-boolean-query transformation apparently happens not only for prefix queries, but for range queries too, see e.g. Help needed figuring out reason for maxClauseCount is set to 1024 error and the Lucene BooleanQuery.TooManyClauses docs: "Thrown when an attempt is made to add more than BooleanQuery.getMaxClauseCount() clauses. This typically happens if a PrefixQuery, FuzzyQuery, WildcardQuery, or TermRangeQuery is expanded to many terms during search"

Community
  • 1
  • 1
KajMagnus
  • 11,308
  • 15
  • 79
  • 127