2

In ElasticSearch I am trying to get correct scoring using edge_ngram with fuzziness. I would like exact matches to have the highest score and sub matches have lesser scores. Below is my setup and scoring results.

settings: {
          number_of_shards: 1,
          analysis: {
             filter: {
                ngram_filter: {
                   type: 'edge_ngram',
                   min_gram: 2,
                   max_gram: 20
                }
             },
             analyzer: {
                ngram_analyzer: {
                   type: 'custom',
                   tokenizer: 'standard',
                   filter: [
                      'lowercase',
                      'ngram_filter'
                   ]
                }
             }
          }
       },
    mappings: [{
          name: 'voter',
          _all: {
                'type': 'string',
                'index_analyzer': 'ngram_analyzer',
                'search_analyzer': 'standard'
             },
             properties: {
                last: {
                   type: 'string',
                   required : true,
                   include_in_all: true,
                   term_vector: 'yes',
                   index_analyzer: 'ngram_analyzer',
                   search_analyzer: 'standard'
                },
                first: {
                   type: 'string',
                   required : true,
                   include_in_all: true,
                   term_vector: 'yes',
                   index_analyzer: 'ngram_analyzer',
                   search_analyzer: 'standard'
                },

             }

       }]

After doing a POST with first name "Michael" I do a query as below with changes "Michael", "Michae", "Micha", "Mich", "Mic", and "Mi".

GET voter/voter/_search
{
 "query": {
    "match": {
      "_all": {
        "query": "Michael",
        "fuzziness": 2,
        "prefix_length": 1
      }
    }
  }
}

My score results are:

-"Michael": 0.19535106
-"Michae": 0.2242768
-"Micha": 0.24513611
-"Mich": 0.22340237
-"Mic": 0.21408978
-"Mi": 0.15438235

As you can see the score results aren't getting as expected. I would like "Michael" to have the highest score and "Mi" to have the lowest

Any help would be appreciated!

emarel
  • 371
  • 7
  • 30
  • It's not practical to compare scores for different queries (dig into the [lucene scoring function](https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html) to understand what happens with query normalization). Also your fuzzy operation is probably confusing things since each bigram is within two edits of each other bigram. Try removing the fuzziness and repeating your test. – Peter Dixon-Moses Nov 21 '15 at 14:04

1 Answers1

0

One way to approach this problem would be to add raw version of text in your mapping like this

                   last: {
                       type: 'string',
                       required : true,
                       include_in_all: true,
                       term_vector: 'yes',
                       index_analyzer: 'ngram_analyzer',
                       search_analyzer: 'standard',
                       "fields": {
                            "raw": { 
                               "type":  "string"  <--- index with standard analyzer
                              }
                          }
                    },
                    first: {
                       type: 'string',
                       required : true,
                       include_in_all: true,
                       term_vector: 'yes',
                       index_analyzer: 'ngram_analyzer',
                       search_analyzer: 'standard',
                       "fields": {
                            "raw": { 
                               "type":  "string"  <--- index with standard analyzer
                              }
                          }
                    },

You could also make it exact with index : not_analyzed

Then you can query like this

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "_all": {
              "query": "Michael",
              "fuzziness": 2,
              "prefix_length": 1
            }
          }
        },
        {
          "match": {
            "last.raw": {
              "query": "Michael",
              "boost": 5
            }
          }
        },
        {
          "match": {
            "first.raw": {
              "query": "Michael",
              "boost": 5
            }
          }
        }
      ]
    }
  }
}

Documents that matches more clauses will be scored higher. You could specify boost according to your requirements.

ChintanShah25
  • 12,366
  • 3
  • 43
  • 44
  • Unfortunately that did not fully work. While it does give me a higher score for exact matches it does not handle partial matches scoring wise. – emarel Nov 30 '15 at 18:31