1

I have a index created in ElasticSearch with the field name where I store the whole name of a person: Name and Surname. I want to perform full text search over that field so I have indexed it using the analyzer.

My issue now is that if I search: "John Rham Rham"

And in the index I had "John Rham Rham Luck", that value has higher score than "John Rham Rham". Is there any posibility to have better score on the exact field than in the field with more values in the string?

Thanks in advance!

sayfra85
  • 11
  • 4

1 Answers1

0

I worked out a small example (assuming you're running on ES 5.x cause of the difference in scoring):

DELETE test
PUT test
{
  "settings": {
    "similarity": {
      "my_bm25": {
        "type": "BM25",
        "b": 0
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "text",
          "similarity": "my_bm25",
          "fields": {
            "length": {
              "type": "token_count",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

POST test/test/1
{
  "name": "John Rham Rham"
}
POST test/test/2
{
  "name": "John Rham Rham Luck"
}
GET test/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "name": {
            "query": "John Rham Rham",
            "operator": "and"
          }
        }
      },
      "functions": [
        {
          "script_score": {
            "script": "_score / doc['name.length'].getValue()"
          }
        }
      ]
    }
  }
}

This code does the following:

  • Replace the default BM25 implementation with a custom one, tweaking the B parameter (field length normalisation) -- You could also change the similarity to 'classic' to go back to TF/IDF which doesn't have this normilisation
  • Create an inner field for your name field, which counts the number of tokens inside your name field.
  • Update the score according to the length of the token

This will result in:

"hits": {
    "total": 2,
    "max_score": 0.3596026,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "1",
        "_score": 0.3596026,
        "_source": {
          "name": "John Rham Rham"
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "2",
        "_score": 0.26970196,
        "_source": {
          "name": "John Rham Rham Luck"
        }
      }
    ]
  }
}

Not sure if this is the best way of doing it, but it maybe point you in the right direction :)

Byron Voorbach
  • 4,365
  • 5
  • 27
  • 35