23

Is there a way to have ElasticSearch identify exact matches on analyzed fields? Ideally, I would like to lowercase, tokenize, stem and perhaps even phoneticize my docs, then have queries pull "exact" matches out.

What I mean is that if I index "Hamburger Buns" and "Hamburgers", they will be analyzed as ["hamburger","bun"] and ["hamburger"]. If I search for "Hamburger", it will only return the "hamburger" doc, as that's the "exact" match.

I've tried using the keyword tokenizer, but that won't stem the individual tokens. Do I need to do something to ensure that the number of tokens is equal or so?

I'm familiar with multi-fields and using the "not_analyzed" type, but this is more restrictive than I'm looking for. I'd like exact matching, post-analysis.

abroekhof
  • 796
  • 1
  • 7
  • 20

3 Answers3

14

Use shingles tokenizer together with stemming and whatever else you need. Add a sub-field of type token_count that will count the number of tokens in the field.

At searching time, you need to add an additional filter to match the number of tokens in the index with the number of tokens you have in the searching text. You would need an additional step, when you perform the actual search, that should count the tokens in the searching string. This is like this because shingles will create multiple permutations of tokens and you need to make sure that it matches the size of your searching text.

An attempt for this, just to give you an idea:

{
  "settings": {
    "analysis": {
      "filter": {
        "filter_shingle": {
          "type": "shingle",
          "max_shingle_size": 10,
          "min_shingle_size": 2,
          "output_unigrams": true
        },
        "filter_stemmer": {
          "type": "porter_stem",
          "language": "_english_"
        }
      },
      "analyzer": {
        "ShingleAnalyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "snowball",
            "filter_stemmer",
            "filter_shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "ShingleAnalyzer",
          "fields": {
            "word_count": {
              "type": "token_count",
              "store": "yes",
              "analyzer": "ShingleAnalyzer"
            }
          }
        }
      }
    }
  }
}

And the query:

{
  "query": {
    "filtered": {
      "query": {
        "match_phrase": {
          "text": {
            "query": "HaMbUrGeRs BUN"
          }
        }
      },
      "filter": {
        "term": {
          "text.word_count": "2"
        }
      }
    }
  }
}

The shingles filter is important here because it can create combinations of tokens. And more than that, these are combinations that keep the order or the tokens. Imo, the most difficult requirement to fulfill here is to change the tokens (stemming, lowercasing etc) and, also, to assemble back the original text. Unless you define your own "concatenation" filter I don't think there is any other way than using the shingles filter.

But with shingles there is another issue: it creates combinations that are not needed. For a text like "Hamburgers buns in Los Angeles" you end up with a long list of shingles:

          "angeles",
          "buns",
          "buns in",
          "buns in los",
          "buns in los angeles",
          "hamburgers",
          "hamburgers buns",
          "hamburgers buns in",
          "hamburgers buns in los",
          "hamburgers buns in los angeles",
          "in",
          "in los",
          "in los angeles",
          "los",
          "los angeles"

If you are interested in only those documents that match exactly meaning, the documents above matches only when you search for "hamburgers buns in los angeles" (and doesn't match something like "any hamburgers buns in los angeles") then you need a way to filter that long list of shingles. The way I see it is to use word_count.

Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
  • What is the purpose of the shingles? – abroekhof May 29 '15 at 18:11
  • Also, is there a reason for using both Porter and Snowball stemmer? – abroekhof May 29 '15 at 18:24
  • No reason. That's just an example I had around and able to change it quickly to show some real code. The important parts are the `shingle` filter, the `token_count` type field and the query itself. The rest of the filters are just example: they can be taken out, other stuff added. – Andrei Stefan May 29 '15 at 19:13
  • Hey thanks. Can you explain the importance of the shingle filter? – abroekhof May 29 '15 at 19:23
6

You can use multi-fields for that purpose and have a not_analyzed sub-field within your analyzed field (let's call it item in this example). Your mapping would have to look like this:

{
  "yourtype": {
    "properties": {
      "item": {
        "type": "string",
        "fields": {
          "raw": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      }
    }
  }
}

With this kind of mapping, you can check how each of the values Hamburgers and Hamburger Buns are "viewed" by the analyzer with respect to your multi-field item and item.raw

For Hamburger:

curl -XGET 'localhost:9200/yourtypes/_analyze?field=item&pretty' -d 'Hamburger'
{
  "tokens" : [ {
    "token" : "hamburger",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item.raw&pretty' -d 'Hamburger'
{
  "tokens" : [ {
    "token" : "Hamburger",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  } ]
}

For Hamburger Buns:

curl -XGET 'localhost:9200/yourtypes/_analyze?field=item&pretty' -d 'Hamburger Buns'
{
  "tokens" : [ {
    "token" : "hamburger",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "buns",
    "start_offset" : 11,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item.raw&pretty' -d 'Hamburger Buns'
{
  "tokens" : [ {
    "token" : "Hamburger Buns",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "word",
    "position" : 1
  } ]
}

As you can see, the not_analyzed field is going to be indexed untouched exactly as it was input.

Now, let's index two sample documents to illustrate this:

curl -XPOST localhost:9200/yourtypes/_bulk -d '
{"index": {"_type": "yourtype", "_id": 1}}
{"item": "Hamburger"}
{"index": {"_type": "yourtype", "_id": 2}}
{"item": "Hamburger Buns"}
'

And finally, to answer your question, if you want to have an exact match on Hamburger, you can search within your sub-field item.raw like this (note that the case has to match, too):

curl -XPOST localhost:9200/yourtypes/yourtype/_search -d '{
  "query": {
    "term": {
      "item.raw": "Hamburger"
    }
  }
}'

And you'll get:

{
  ...
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "yourtypes",
      "_type" : "yourtype",
      "_id" : "1",
      "_score" : 0.30685282,
      "_source":{"item": "Hamburger"}
    } ]
  }
}

UPDATE (see comments/discussion below and question re-edit)

Taking your example from the comments and trying to have HaMbUrGeR BuNs match Hamburger buns you could simply achieve it with a match query like this.

curl -XPOST localhost:9200/yourtypes/yourtype/_search?pretty -d '{
  "query": {
    "match": {
      "item": {
        "query": "HaMbUrGeR BuNs",
        "operator": "and"
      }
    }
  }
}'

Which based on the same two indexed documents above will yield

{
  ...
  "hits" : {
    "total" : 1,
    "max_score" : 0.2712221,
    "hits" : [ {
      "_index" : "yourtypes",
      "_type" : "yourtype",
      "_id" : "2",
      "_score" : 0.2712221,
      "_source":{"item": "Hamburger Buns"}
    } ]
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360
  • 1
    Hey, thanks for the time you put into this answer, unfortunately it doesn't answer my question. I understand that if I search for the exact term in the not_analyzed field, it will return the correct result, but I'm looking for more flexibility. For example, I want it to return "Hamburger Buns" if I search "HaMbUrGeRs BuN", which "not_analyzed" won't do. This is the "exact" result, as they match, after analyzing. Does this make sense? – abroekhof May 29 '15 at 04:32
  • Yes, that makes sense. Sorry if I misunderstood your question. You should, however, update your question and mention that you know about multi fields and that's not what you're looking for. – Val May 29 '15 at 04:33
  • Your updated query will retrieve "Whole Wheat Hamburger Buns", if that were in the index as well, right? Ideally, it would would return only the exact match, similar to a filter. I'm wondering if this is actually possible with ElasticSearch – abroekhof May 29 '15 at 05:05
  • Nope, searching for "Whole Wheat Hamburger Buns" with the same query returns no results (because of the "and" operator) and with the "or" operator it would return both documents. – Val May 29 '15 at 05:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/79155/discussion-between-abroekhof-and-val). – abroekhof May 29 '15 at 18:03
5

You can keep the analyzer as what you expected (lowercase, tokenize, stem, ...), and use query_string as the main query, match_phrase as the boosting query to search. Something like this:

{
   "bool" : {
      "should" : [
         {
            "query_string" : {
               "default_field" : "your_field",
               "default_operator" : "OR",
               "phrase_slop" : 1,
               "query" : "Hamburger"
            }
         },
         {
            "match_phrase": {
               "your_field": {
                  "query": "Hamburger"
               }
            }
         }
      ]
   }
}

It will match both documents, and exact match (match_phrase) will be on top since the query match both should clauses (and get higher score)

default_operator is set to OR, it will help the query "Hamburger Buns" (match hamburger OR bun) match the document "Hamburger" also. phrase_slop is set to 1 to match terms with distance = 1 only, e.g. search for Hamburger Buns will not match document Hamburger Big Buns. You can adjust this depend on your requirements.

You can refer Closer is better, Query string for more details.

Duc.Duong
  • 2,770
  • 2
  • 23
  • 28