0

I am upgrading an Elasticsearch instance from 1.7 to 5.4.3, and noticed that the search results are different between the two systems, even when using the same query.

Elasticsearch 1.7 query

{
  "query": {
    "filtered": {
      "query": {
        "multi_match": {
          "query": "something",
          "fields": [
            "field1",
            "field2",
            "field3"
          ],
          "operator": "and"
        }
      }
    }
  }
}

Elasticsearch 5.4 query

{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "something",
            "fields": [
              "field1",
              "field2",
              "field3"
            ],
            "operator": "and"
          }
        }
      ]
    }
  }
}

The 1st search result in Elasticsearch 1.7 becomes the 71st result in Elasticsearch 5.4. When I look at the same search result between 1.7 and 5.4 with the _explain endpoint, I see that the scoring is done differently. Also, this query includes synonyms, which the search query matches.

Explain for Elasticsearch 1.7

{
    "_index": "...",
    "_type": "...",
    "_id": "...",
    "matched": true,
    "explanation": {
        "value": 9.963562,
        "description": "max of:",
        "details": [
            {
                "value": 3.1413355,
                "description": "sum of:",
                "details": [
                    {
                        "value": 1.0609967,
                        "description": "weight(field1:something in 13) [PerFieldSimilarity], result of:",
                        "details": [
...remainder removed for brevity

Explain for Elasticsearch 5.4

{
    "_index": "...",
    "_type": "...",
    "_id": "...",
    "matched": true,
    "explanation": {
        "value": 7.1987557,
        "description": "sum of:",
        "details": [
            {
                "value": 7.1987557,
                "description": "max of:",
                "details": [
                    {
                        "value": 6.659632,
                        "description": "weight(Synonym(field1:something field1:something2 field1:something3) in 113) [PerFieldSimilarity], result of:",
                        "details": [
...remainder removed for brevity

Questions

  1. Any obvious reason why my search results would be so different for the equivalent query in both versions?
  2. Does the fact that the _explain query for Elasticsearch 1.7 shows max of higher than sum of for the calculations, and it is the opposite for Elasticsearch 5.4, indicate part of the problem?
mnd
  • 2,709
  • 3
  • 27
  • 48

1 Answers1

0

The default "similarity" changed in Elasticsearch 5.0, from TF/IDF to BM25

Technically this is actually a change when moving to Lucene 6.2 (default of Elasticsearch 5.0.0).

The Elasticsearch 5.0.0 Release Notes include the following line:

Change default similarity to BM25 #18948 (issue: #18944)

You can read more about Elasticsearch similarity here. This is how two fields are compared with one another (especially those with the "text" mapping). In pre-5.0.0, the default similarity was TF/IDF (term frequency, inverse document frequency), this was later changed to BM25 (Best Match 25). This change will cause a different set of results, which is intended to be a better set of search results.

If you want to use the previous behavior you can alter the similarity in the mapping file to use classic (which refers to TF/IDF). For instance, your YAML mapping file could have:

description:
  type: text
  similarity: classic

Useful links with more information:

mnd
  • 2,709
  • 3
  • 27
  • 48