6

A common problem in search interfaces is that you want to return a selection of results, but might want to return information about all documents. (e.g. I want to see all red shirts, but want to know what other colors are available).

This is sometimes referred to as "faceted results", or "faceted navigation". the example from the Elasticsearch reference is quite clear in explaining why / how, so I've used this as a base for this question.

Summary / Question: It looks like I can use both a post-filter or a global aggregation for this. They both seem to provide the exact same functionality in a different way. There might be advantages or disadvantages to them that I don't see? If so, which should I use?

I have included a complete example below with some documents and a query with both types of method based on the example in the reference guide.


Option 1: post-filter

see the example from the Elasticsearch reference

What we can do is have more results in our origional query, so we can aggregate 'on' those results, and afterwards filter our actual results.

The example is quite clear in explaining it:

But perhaps you would also like to tell the user how many Gucci shirts are available in other colors. If you just add a terms aggregation on the color field, you will only get back the color red, because your query returns only red shirts by Gucci.

Instead, you want to include shirts of all colors during aggregation, then apply the colors filter only to the search results.

See for how this would look below in the example code.

An issue with this is that we cannot use caching. This is in the (not yet available for 5.1) elasticsearch guide warned about:

Performance consideration Use a post_filter only if you need to differentially filter search results and aggregations. Sometimes people will use post_filter for regular searches.

Don’t do this! The nature of the post_filter means it runs after the query, so any performance benefit of filtering (such as caches) is lost completely.

The post_filter should be used only in combination with aggregations, and only when you need differential filtering.

There is however a different option:

Option 2: global aggregations

There is a way to do an aggregation that is not influenced by the search query. So instead of getting a lot, aggregate on that, then filter, we just get our filtered results, but do aggregations on everything. Take a look at the reference

We can get the exact same results. I did not read any warnings about caching for this, but it seems like in the end we need to do about the same amount of work. So that maybe the only ommission.

It is a tiny bit more complicated because of the sub-aggregation we need (you can't have global and a filter on the same 'level').

The only complaint I read about queries using this, is that you might have to repeat yourself if you need to do this for several items. In the end we can generate most queries, so repeating oneself isn't that much of an issue for my usecase, and I do not really consider this an issue on par with "can not use cache".

Question

It seems both functions are overlapping in the least, or possibly providing the exact same functionality. This baffles me. Apart from that, I'd like to know if one or the other has an advantage I haven't seen, and if there is any best practice here?

Example

This is largely from the post-filter reference page, but I added the global filter query.

mapping and documents

PUT /shirts
{
    "mappings": {
        "item": {
            "properties": {
                "brand": { "type": "keyword"},
                "color": { "type": "keyword"},
                "model": { "type": "keyword"}
            }
        }
    }
}

PUT /shirts/item/1?refresh
{
    "brand": "gucci",
    "color": "red",
    "model": "slim"
}

PUT /shirts/item/2?refresh
{
    "brand": "gucci",
    "color": "blue",
    "model": "slim"
}


PUT /shirts/item/3?refresh
{
    "brand": "gucci",
    "color": "red",
    "model": "normal"
}


PUT /shirts/item/4?refresh
{
    "brand": "gucci",
    "color": "blue",
    "model": "wide"
}


PUT /shirts/item/5?refresh
{
    "brand": "nike",
    "color": "blue",
    "model": "wide"
}

PUT /shirts/item/6?refresh
{
    "brand": "nike",
    "color": "red",
    "model": "wide"
}

We are now requesting all red gucci shirts (item 1 and 3), the types of shirts we have (slim and normal) for these 2 shirts, and which colors gucci there are (red and blue).

First, a post filter: get all shirts, aggregate the models for red gucci shirts and the colors for gucci shirts (all colors), and post-filter for red gucci shirts to show only those as results: (this is a bit different from the example, as we try to get it as close to a clear application of postfilters as possilbe.)

GET /shirts/_search
{
  "aggs": {
    "colors_query": {
      "filter": {
        "term": {
          "brand": "gucci"
        }
      },
      "aggs": {
        "colors": {
          "terms": {
            "field": "color"
          }
        }
      }
    },
    "color_red": {
      "filter": {
        "bool": {
          "filter": [
            {
              "term": {
                "color": "red"
              }
            },
            {
              "term": {
                "brand": "gucci"
              }
            }
          ]
        }
      },
      "aggs": {
        "models": {
          "terms": {
            "field": "model"
          }
        }
      }
    }
  },
  "post_filter": {
    "bool": {
      "filter": [
        {
          "term": {
            "color": "red"
          }
        },
        {
          "term": {
            "brand": "gucci"
          }
        }
      ]
    }
  }
}

We could also get all red gucci shirts (our origional query), and then do a global aggregation for the model (for all red gucci shirts) and for color (for all gucci shirts).

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "color": "red"   }},
        { "term": { "brand": "gucci" }}
      ]
    }
  },
  "aggregations": {
    "color_red": {
      "global": {},
      "aggs": {
        "sub_color_red": {
          "filter": {
            "bool": {
              "filter": [
                { "term": { "color": "red"   }},
                { "term": { "brand": "gucci" }}
              ]
            }
          },
          "aggs": {
            "keywords": {
              "terms": {
                "field": "model"
              }
            }
          }
        }
      }
    },
    "colors": {
      "global": {},
      "aggs": {
        "sub_colors": {
          "filter": {
            "bool": {
              "filter": [
                { "term": { "brand": "gucci" }}
              ]
            }
          },
          "aggs": {
            "keywords": {
              "terms": {
                "field": "color"
              }
            }
          }
        }
      }
    }
  }
}

Both will return the same information, the second one only differs because of the extra level introduced by the sub-aggregations. The second query looks a bit more complex, but I don't think this is very problematic. A real world query is generated by code, probably way more complex anyway and it should be a good query and if that means complicated, so be it.

Nanne
  • 64,065
  • 16
  • 119
  • 163

2 Answers2

3

The actual solution we used, while not a direct answer to the question, is basically "neither".

From this elastic blogpost we got the initial hint:

Occasionally, I see an over-complicated search where the goal is to do as much as possible in as few search requests as possible. These tend to have filters as late as possible, completely in contrary to the advise in Filter First. Do not be afraid to use multiple search requests to satisfy your information need. The multi-search API lets you send a batch of search requests.

Do not shoehorn everything into a single search request.

And that is basically what we are doing in above query: a big bunch of aggregations and some filtering.

Having them run in parallel proved to be much and much quicker. Have a look at the multi-search API

Community
  • 1
  • 1
Nanne
  • 64,065
  • 16
  • 119
  • 163
1

In both cases Elasticsearch will end up doing mostly the same thing. If I had to choose, I think I'd use the global aggregation, which might save you some overhead from having to feed two Lucene collectors at once.

jpountz
  • 9,904
  • 1
  • 31
  • 39
  • So they end up doing the same functionally, but the post-filter might have some overhead? I don't know much about lucene collectors, could you expand a bit on what you mean there, or hit me up with a link on what you are referencing there? – Nanne Dec 23 '16 at 07:42
  • The important bit in my answer is that it does not really matter. The collector argument is that in the post-filter case, stack traces have one level more due to the use of MultiCollector since everything is done in a single pass, while every global aggregation does another pass over the data (but with a match_all query). – jpountz Dec 28 '16 at 19:56
  • Another way to try to solve this problem would be to send multiple requests, one for each set that you want to analyze. This removes the guarantee that all requests see exactly the same point-in-time view of the index, but on slowly changing data, that is probably acceptable, and that also makes things easier to scale since things like the request cache are more likely to be leveraged. – jpountz Dec 28 '16 at 20:00
  • I skipped that option specifically. It might scale in some ways, but as your filters grow you need to add a roundtrip to your query(-set) for each additional filter. At a certain point the overhead from each of these requests will slow you down, apart from the time it takes elastic to calculate the result. So that probably doesn't scale for the amount of filters? – Nanne Dec 30 '16 at 09:47