10

I'm building a product search engine with Elastic Search in my .NET application, by using the NEST client, and there is one thing i'm having trouble with. Getting a distinct set of values.

I'm search for products, which there are many thousands, but of course i can only return 10 or 20 at a time to the user. And for this paging works fine. But besides this primary result, i want to show my users a list of brands that are found within the complete search, to present these for filtering.

I have read about that i should use Terms Aggregations for this. But, i couldn't get anything better than this. And this still doesn't really give me what i want, because it splits values like "20th Century Fox" into 3 separate values.

    var brandResults = client.Search<Product>(s => s
         .Query(query)
         .Aggregations(a => a.Terms("my_terms_agg", t => t.Field(p => p.BrandName).Size(250))
         )
     );

    var agg = brandResult.Aggs.Terms("my_terms_agg");

Is this even the right approach? Or should is use something totally different? And, how can i get the correct, complete values? (Not split by space .. but i guess that is what you get when you ask for a list of 'Terms'??)

What i'm looking for is what you would get if you would do this in MS SQL

SELECT DISTINCT BrandName FROM [Table To Search] WHERE [Where clause without paging]
Bart
  • 103
  • 1
  • 8

2 Answers2

4

You are correct that what you want is a terms aggregation. The problem you're running into is that ES is splitting the field "BrandName" in the results it is returning. This is the expected default behavior of a field in ES.

What I recommend is that you change BrandName into a "Multifield", this will allow you to search on all the various parts, as well as doing a terms aggregation on the "Not Analyzed" (aka full "20th Century Fox") term.

Here is the documentation from ES.

https://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html

[UPDATE] If you are using ES version 1.4 or newer the syntax for multi-fields is a little different now.

https://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html#_multi_fields

Here is a full working sample the illustrate the point in ES 1.4.4. Note the mapping specifies a "not_analyzed" version of the field.

PUT hilden1

PUT hilden1/type1/_mapping
{
  "properties": {
    "brandName": {
      "type": "string",
      "fields": {
        "raw": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}

POST hilden1/type1
{
  "brandName": "foo"
}

POST hilden1/type1
{
  "brandName": "bar"
}

POST hilden1/type1
{
  "brandName": "20th Century Fox"
}

POST hilden1/type1
{
  "brandName": "20th Century Fox"
}

POST hilden1/type1
{
  "brandName": "foo bar"
}

GET hilden1/type1/_search
{
  "size": 0, 
  "aggs": {
    "analyzed_field": {
      "terms": {
        "field": "brandName",
        "size": 10
      }
    },
    "non_analyzed_field": {
      "terms": {
        "field": "brandName.raw",
        "size": 10
      }
    }    
  }
}

Results of the last query:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "non_analyzed_field": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "20th Century Fox",
               "doc_count": 2
            },
            {
               "key": "bar",
               "doc_count": 1
            },
            {
               "key": "foo",
               "doc_count": 1
            },
            {
               "key": "foo bar",
               "doc_count": 1
            }
         ]
      },
      "analyzed_field": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "20th",
               "doc_count": 2
            },
            {
               "key": "bar",
               "doc_count": 2
            },
            {
               "key": "century",
               "doc_count": 2
            },
            {
               "key": "foo",
               "doc_count": 2
            },
            {
               "key": "fox",
               "doc_count": 2
            }
         ]
      }
   }
}

Notice that not-analyzed fields keep "20th century fox" and "foo bar" together where as the analyzed field breaks them up.

jhilden
  • 12,207
  • 5
  • 53
  • 76
  • I just started with this a week ago. So i'm working on the latest 1.4.4 version. – Bart Feb 23 '15 at 15:52
  • What do you mean by changing BrandName. Updating the database schema? Or change it inline, in my query? – Bart Feb 23 '15 at 15:59
  • Change the ES (database) indexer. – jhilden Feb 23 '15 at 16:40
  • Having a bit of trouble creating that custom mapping with multifield, so i opened another question about that topic. http://stackoverflow.com/questions/28681686/creating-elasticsearch-mapping-with-multifield – Bart Feb 23 '15 at 19:23
  • @Bart, I just updated my answer with a full sample that should make things more clear. – jhilden Feb 24 '15 at 22:32
1

I had a similar issue. I was displaying search results and wanted to show counts on the category and sub category.

You're right to use aggregations. I also had the issue with the strings being tokenised (i.e. 20th century fox being split) - this happens because the fields are analysed. For me, I added the following mappings (i.e. tell ES not to analyse that field):

  "category": {
          "type": "nested",
          "properties": {
            "CategoryNameAndSlug": {
              "type": "string",
              "index": "not_analyzed"
            },
            "SubCategoryNameAndSlug": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }

As jhilden suggested, if you use this field for more than one reason (e.g. search and aggregation) you can set it up as a multifield. So on one hand it can get analysed and used for searching and on the other hand for not being analysed for aggregation.

Ali
  • 116
  • 7