Join / split search words in elasticsearch (using tire)

Question

I have the following analyzer (a slight tweak to the way snowball would be setup):

  string_analyzer: {
    filter: [ "standard", "stop", "snowball" ],
    tokenizer: "lowercase"
  }

Here is the field it is applied to:

  indexes :title, type: 'string', analyzer: 'string_analyzer'

  query do
    match ['title'], search_terms, fuzziness: 0.5, max_expansions: 10, operator: 'and'
  end

I have a record in my index with title foo bar.

If I search for foo bar it appears in the results.

However, if I search for foobar it doesn't.

Can someone explain why and if possible how I could get it to?

Can someone explain how I could get the reverse of this to work as well so that if I had a record with title foobar a user could search for foo bar and see it as a result?

Thanks

DrTech · Accepted Answer · 2013-02-09T10:41:21.063

You can only search for tokens that are in your index. So let's look at what you are indexing. You're currently using the lowercase tokenizer (which tokenizes a string on non-letter characters and lowercases them) then applying the standard filter (redundant, because you are not using the standard tokenizer), the stop and snowball filters.

If we create that analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "string_analyzer" : {
               "filter" : [
                  "standard",
                  "stop",
                  "snowball"
               ],
               "tokenizer" : "lowercase"
            }
         }
      }
   }
}
'

and use the analyze API to test it out:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=foo+bar&analyzer=string_analyzer'

you'll see that "foo bar" produces the terms ["foo","bar"] and "foobar" produces the term ["foobar"]. So indexing "foo bar" and searching for "foobar" currently cannot work.

If you want to be able to search "inside" words, then you need to break words up into smaller tokens. To do this, we use the ngram analyzer.

So delete the test index:

curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'

and specify a new analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "ngrams" : {
               "max_gram" : 5,
               "min_gram" : 1,
               "type" : "ngram"
            }
         },
         "analyzer" : {
            "ngrams" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "ngrams"
               ],
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Now, if we test the analyzer, we get:

"foo bar" => [f,o,o,fo,oo,foo,b,a,r,ba,ar,bar]
"foobar"  => [f,o,o,b,a,r,fo,oo,ob,ba,ar,foo,oob,oba,bar,foob,ooba,obar,fooba,oobar]

So if we index "foo bar" and we search for "foobar" using the match query, then the query becomes a query looking for any of those tokens, some of which exist in the index.

Unfortunately, it'll also overlap with "wear the fox hat" (f,o,a). While foobar will appear higher up the list of results because it has more tokens in common, you will still get apparently unrelated results.

This can be controlled by using the minimum_should_match parameter, eg:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "match" : {
         "my_field" : {
            "minimum_should_match" : "60%",
            "query" : "foobar"
         }
      }
   }
}
'

The exact value for minimim_should_match depends upon your data - experiment with it.

Thanks DrTech. Is there anything to gain by adding snowball as a filter or is there no point since the start of the words would be matched against the search term anyway using the ngrams filter anyway? — user1116573, Feb 09 '13 at 12:57
And is there a particular reason for removing the stop filter or is it just that some of the words the stop filter could remove might be chunks of the ngrams filter? — user1116573, Feb 09 '13 at 13:11
Correct, nothing to gain by using the snowball filter, for the reasons you state, and yes, stop words could interfere with ngrams. I wouldn't be afraid of stopwords. Have a look at my answer on http://stackoverflow.com/a/14661309/819598 — DrTech, Feb 09 '13 at 14:17
Thanks DrTech. I've got a small issue. I can now search using `foo bar` and it returns those that contain `foo bar` and `foobar` but I can't seem to do the reverse - search using `foobar` and return both results. I can see that that may be caused by the fact that the record with `foo bar` never has an index item that matches `foobar`. Is there a way? I think I may be trying to cover to many search possibilities. — user1116573, Feb 10 '13 at 21:21
Have you mapped `my_field` to use the `ngrams` analyzer? Are you querying exactly as above? If so, then `minimum_should_match` would need to be set to 30% or lower in order for a query of `foobar` to match `foo` or `bar` as well. — DrTech, Feb 11 '13 at 10:22
DrTech, rather than dropping down the `minimum_should_match` is there a way to use a shingle token filter? — user1116573, Feb 18 '13 at 22:46

Join / split search words in elasticsearch (using tire)

1 Answers1