15

I'm implementing an auto-complete index in ElasticSearch and have run into an issue with sorting/scoring. Say I have the following strings in an index:

apple banana coconut donut
apple banana donut durian
apple donut coconut durian
donut banana coconut durian

When I search for "donut", I want the results to be ordered by the term location like so:

donut banana coconut durian
apple donut coconut durian
apple banana donut durian
apple banana coconut donut

I can't figure out how to make that happen. Term position isn't factored into the default scoring logic, and I can't find a way to get it in there. Seems like a simple enough issue though that others must have run into this before. Has anyone figured it out yet?

Thanks!

elena
  • 3,740
  • 5
  • 27
  • 38
IGx89
  • 872
  • 7
  • 18
  • Maybe this will help http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html – Konstantin V. Salikhov Dec 18 '14 at 07:09
  • It would, I had started going down that route, until I discovered that the script doesn't have access to the tokenized search string :( – IGx89 Dec 18 '14 at 15:34

2 Answers2

6

You can do a custom sorting, like this:

{
  "query": {
    "match": {
      "content": "donut"
    }
  },
  "sort": {
    "_script": {
      "script": "termInfo=_index['content'].get('donut',_OFFSETS);for(pos in termInfo){return _score+pos.startOffset};",
      "type": "number",
      "order": "asc"
    }
  }
}

In there I just returned the startOffset. If you need something else, play with those values and the original scoring and come up with a comfortable value for your needs.

Or you can do something like this:

{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "content": "donut"
        }
      },
      "script_score": {
        "script": "termInfo=_index['content'].get('donut',_OFFSETS);for(pos in termInfo){return pos.startOffset};"
      },
      "boost_mode": "replace"
    }
  },
  "sort": [
    {
      "_score": "asc"
    }
  ]
}

In either case you need in your mapping for that specific field to have this:

"content": {
  "type": "string",
  "index_options": "offsets"
}

meaning index_options needs to be set to offsets. Here more details about this.

Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
  • Thanks Andrei! Great, thorough answer :). That would almost work, except that I'm using stemming so if I searched, say, for "apple" it wouldn't the search term in the index (because the indexed term is "appl"). It also would not be ideal for search terms with multiple words, though I could probably work around that. – IGx89 Dec 18 '14 at 15:32
  • In this case - with stemmers - it should be simple: transform your field in a `multi_field`. Do whatever search you want on the stemmed part with one sub-field and the custom scoring above on the non-stemmed part: `"content": { "type": "multi_field", "fields": { "content": { "type": "string", "analyzer": "english" }, "content_no_stemmer": { "type": "string", "index_options": "offsets" } } }` – Andrei Stefan Dec 18 '14 at 16:19
  • And the script would change to `"termInfo=_index['content.content_no_stemmer'].get('apple',_OFFSETS)....` Would this work for you? – Andrei Stefan Dec 18 '14 at 16:20
  • It came close, but not as far as I needed unfortunately (see my answer). The main issue was that it would only work if the search string exactly matched a token, so it wouldn't work with "app", "appl", etc. Not good behavior for a query being used for an auto-complete drop-down :(. Thanks so much for your help with this! – IGx89 Dec 19 '14 at 22:03
  • Hi @AndreiStefan is there any recommendation on how to do this for ES 5.5, where `_index` is not available? I have also posted a question here: https://stackoverflow.com/questions/49304331/how-to-sort-results-based-on-term-position-in-elasticsearch-5-5 – virtualmic Mar 15 '18 at 16:18
1

Here's the solution I ended up with, based on Andrei's answer and expanded to support multiple search terms and additional scoring based on length of the first word in the result:

First, define the following custom analyzer (it keeps the entire string as a single token and lowercases it):

"raw_analyzer": {
    "type": "custom",
    "filter": [
        "lowercase"
    ],
    "tokenizer": "keyword"
}

Second, define your search field mapping like so (mine's named "name"):

"name": {
    "type": "string",
    "analyzer": "english",
    "fields": {
        "raw": {
            "type": "string",
            "index_analyzer": "raw_analyzer",
            "search_analyzer": "standard"
        }
    }
},
"_nameFirstWordLength": {
    "type": "long"
}

Third, when populating the index use the following logic (mine's in C#) to populate:

_nameFirstWordLength = fi.Name.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries)[0].Length

Finally, do your search as follows:

{
   "query":{
      "bool":{
         "must":{
            "match_phrase_prefix":{
               "name":{
                  "query":"apple"
               }
            }
         },
         "should":{
            "function_score":{
               "query":{
                  "query_string":{
                     "fields":[
                        "name.raw"
                     ],
                     "query":"apple*"
                  }
               },
               "script_score":{
                  "script":"100/doc['_nameFirstWordLength'].value"
               },
               "boost_mode":"replace"
            }
         }
      }
   }
}

I'm using match_phrase_prefix so that partial matches are supported, such as "ap" matching "apple". The bool must/should with that second query_string query against name.raw gives a higher score to results whose name starts with one of the search terms (in my code I'm pre-processing the search string, just for that second query, to add a "*" after every word). Finally, wrapping that second query in a function_score script that uses the value of _nameFirstWordLength causes the results up-scored by the second query to be further sorted by the length of their first word (causing Apple to show before Applebee's, for example).

IGx89
  • 872
  • 7
  • 18