3

I have a two node elastic search setup where the same search query on the one node results in different results than on the other and I would like to find out why that is the case. Details:

  • The same documents (equal content and id) have a different score on the two nodes resulting in different sort order.
  • It is reproducible: I can delete the whole index and rebuild it from database and still the results are different.
  • Two es nodes are deployed embedded in a java ee war. On each deployment the index is rebuild from database.
  • Initially when the problem was found the hits.total results for the same query where different on the two nodes. They are the same after I have deleted and rebuilt the index.
  • My workaround for now is to use preferences=_local as suggested here.
  • I couldn't find any interesting errors in the logs so far.

_cluster/state:

{
    "cluster_name": "elasticsearch.abc",
    "version": 330,
    "master_node": "HexGKOoHSxqRaMmwduCVIA",
    "blocks": {},
    "nodes": {
        "rUZDrUfMR1-RWcy4t0YQNw": {
            "name": "Owl",
            "transport_address": "inet[/10.123.123.123:9303]",
            "attributes": {}
        },
        "HexGKOoHSxqRaMmwduCVIA": {
            "name": "Bloodlust II",
            "transport_address": "inet[/10.123.123.124:9303]",
            "attributes": {}
        }
    },
    "metadata": {
        "templates": {},
        "indices": {
            "abc": {
                "state": "open",
                "settings": {
                    "index": {
                        "creation_date": "1432297566361",
                        "uuid": "LKx6Ro9CRXq6JZ9a29jWeA",
                        "analysis": {
                            "filter": {
                                "substring": {
                                    "type": "nGram",
                                    "min_gram": "1",
                                    "max_gram": "50"
                                }
                            },
                            "analyzer": {
                                "str_index_analyzer": {
                                    "filter": [
                                        "lowercase",
                                        "substring"
                                    ],
                                    "tokenizer": "keyword"
                                },
                                "str_search_analyzer": {
                                    "filter": [
                                        "lowercase"
                                    ],
                                    "tokenizer": "keyword"
                                }
                            }
                        },
                        "number_of_replicas": "1",
                        "number_of_shards": "5",
                        "version": {
                            "created": "1050099"
                        }
                    }
                },
                "mappings": {
                    "some_mapping": {
                        ...
                    }
                    ...
                },
                "aliases": []
            }
        }
    },
    "routing_table": {
        "indices": {
            "abc": {
                "shards": {
                    "0": [
                        {
                            "state": "STARTED",
                            "primary": true,
                            "node": "HexGKOoHSxqRaMmwduCVIA",
                            "relocating_node": null,
                            "shard": 0,
                            "index": "abc"
                        },
                        {
                            "state": "STARTED",
                            "primary": false,
                            "node": "rUZDrUfMR1-RWcy4t0YQNw",
                            "relocating_node": null,
                            "shard": 0,
                            "index": "abc"
                        }
                    ],
                    "1": [
                        {
                            "state": "STARTED",
                            "primary": false,
                            "node": "HexGKOoHSxqRaMmwduCVIA",
                            "relocating_node": null,
                            "shard": 1,
                            "index": "abc"
                        },
                        {
                            "state": "STARTED",
                            "primary": true,
                            "node": "rUZDrUfMR1-RWcy4t0YQNw",
                            "relocating_node": null,
                            "shard": 1,
                            "index": "abc"
                        }
                    ],
                    "2": [
                        {
                            "state": "STARTED",
                            "primary": true,
                            "node": "HexGKOoHSxqRaMmwduCVIA",
                            "relocating_node": null,
                            "shard": 2,
                            "index": "abc"
                        },
                        {
                            "state": "STARTED",
                            "primary": false,
                            "node": "rUZDrUfMR1-RWcy4t0YQNw",
                            "relocating_node": null,
                            "shard": 2,
                            "index": "abc"
                        }
                    ],
                    "3": [
                        {
                            "state": "STARTED",
                            "primary": false,
                            "node": "HexGKOoHSxqRaMmwduCVIA",
                            "relocating_node": null,
                            "shard": 3,
                            "index": "abc"
                        },
                        {
                            "state": "STARTED",
                            "primary": true,
                            "node": "rUZDrUfMR1-RWcy4t0YQNw",
                            "relocating_node": null,
                            "shard": 3,
                            "index": "abc"
                        }
                    ],
                    "4": [
                        {
                            "state": "STARTED",
                            "primary": true,
                            "node": "HexGKOoHSxqRaMmwduCVIA",
                            "relocating_node": null,
                            "shard": 4,
                            "index": "abc"
                        },
                        {
                            "state": "STARTED",
                            "primary": false,
                            "node": "rUZDrUfMR1-RWcy4t0YQNw",
                            "relocating_node": null,
                            "shard": 4,
                            "index": "abc"
                        }
                    ]
                }
            }
        }
    },
    "routing_nodes": {
        "unassigned": [],
        "nodes": {
            "HexGKOoHSxqRaMmwduCVIA": [
                {
                    "state": "STARTED",
                    "primary": true,
                    "node": "HexGKOoHSxqRaMmwduCVIA",
                    "relocating_node": null,
                    "shard": 4,
                    "index": "abc"
                },
                {
                    "state": "STARTED",
                    "primary": true,
                    "node": "HexGKOoHSxqRaMmwduCVIA",
                    "relocating_node": null,
                    "shard": 0,
                    "index": "abc"
                },
                {
                    "state": "STARTED",
                    "primary": false,
                    "node": "HexGKOoHSxqRaMmwduCVIA",
                    "relocating_node": null,
                    "shard": 3,
                    "index": "abc"
                },
                {
                    "state": "STARTED",
                    "primary": false,
                    "node": "HexGKOoHSxqRaMmwduCVIA",
                    "relocating_node": null,
                    "shard": 1,
                    "index": "abc"
                },
                {
                    "state": "STARTED",
                    "primary": true,
                    "node": "HexGKOoHSxqRaMmwduCVIA",
                    "relocating_node": null,
                    "shard": 2,
                    "index": "abc"
                }
            ],
            "rUZDrUfMR1-RWcy4t0YQNw": [
                {
                    "state": "STARTED",
                    "primary": false,
                    "node": "rUZDrUfMR1-RWcy4t0YQNw",
                    "relocating_node": null,
                    "shard": 4,
                    "index": "abc"
                },
                {
                    "state": "STARTED",
                    "primary": false,
                    "node": "rUZDrUfMR1-RWcy4t0YQNw",
                    "relocating_node": null,
                    "shard": 0,
                    "index": "abc"
                },
                {
                    "state": "STARTED",
                    "primary": true,
                    "node": "rUZDrUfMR1-RWcy4t0YQNw",
                    "relocating_node": null,
                    "shard": 3,
                    "index": "abc"
                },
                {
                    "state": "STARTED",
                    "primary": true,
                    "node": "rUZDrUfMR1-RWcy4t0YQNw",
                    "relocating_node": null,
                    "shard": 1,
                    "index": "abc"
                },
                {
                    "state": "STARTED",
                    "primary": false,
                    "node": "rUZDrUfMR1-RWcy4t0YQNw",
                    "relocating_node": null,
                    "shard": 2,
                    "index": "abc"
                }
            ]
        }
    },
    "allocations": []
}

_cluster/health

{
    "cluster_name": "elasticsearch.abc",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 2,
    "number_of_data_nodes": 2,
    "active_primary_shards": 5,
    "active_shards": 10,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "number_of_pending_tasks": 0
}

_cluster/stats

{
    "timestamp": 1432312770877,
    "cluster_name": "elasticsearch.abc",
    "status": "green",
    "indices": {
        "count": 1,
        "shards": {
            "total": 10,
            "primaries": 5,
            "replication": 1,
            "index": {
                "shards": {
                    "min": 10,
                    "max": 10,
                    "avg": 10
                },
                "primaries": {
                    "min": 5,
                    "max": 5,
                    "avg": 5
                },
                "replication": {
                    "min": 1,
                    "max": 1,
                    "avg": 1
                }
            }
        },
        "docs": {
            "count": 19965,
            "deleted": 4
        },
        "store": {
            "size_in_bytes": 399318082,
            "throttle_time_in_millis": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 60772,
            "evictions": 0
        },
        "filter_cache": {
            "memory_size_in_bytes": 15284,
            "evictions": 0
        },
        "id_cache": {
            "memory_size_in_bytes": 0
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 68,
            "memory_in_bytes": 10079288,
            "index_writer_memory_in_bytes": 0,
            "index_writer_max_memory_in_bytes": 5120000,
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set_memory_in_bytes": 0
        },
        "percolate": {
            "total": 0,
            "time_in_millis": 0,
            "current": 0,
            "memory_size_in_bytes": -1,
            "memory_size": "-1b",
            "queries": 0
        }
    },
    "nodes": {
        "count": {
            "total": 2,
            "master_only": 0,
            "data_only": 0,
            "master_data": 2,
            "client": 0
        },
        "versions": [
            "1.5.0"
        ],
        "os": {
            "available_processors": 8,
            "mem": {
                "total_in_bytes": 0
            },
            "cpu": []
        },
        "process": {
            "cpu": {
                "percent": 0
            },
            "open_file_descriptors": {
                "min": 649,
                "max": 654,
                "avg": 651
            }
        },
        "jvm": {
            "max_uptime_in_millis": 2718272183,
            "versions": [
                {
                    "version": "1.7.0_40",
                    "vm_name": "Java HotSpot(TM) 64-Bit Server VM",
                    "vm_version": "24.0-b56",
                    "vm_vendor": "Oracle Corporation",
                    "count": 2
                }
            ],
            "mem": {
                "heap_used_in_bytes": 2665186528,
                "heap_max_in_bytes": 4060086272
            },
            "threads": 670
        },
        "fs": {
            "total_in_bytes": 631353901056,
            "free_in_bytes": 209591468032,
            "available_in_bytes": 209591468032
        },
        "plugins": []
    }
}

Example query:

/_search?from=22&size=1
{
  "query": {
    "bool": {
      "should": [{
        "match": {
          "address.city": {
            "query": "Bremen",
            "boost": 2
          }
        }
      }],
      "must": [{
        "match": {
          "type": "L"
        }
      }]
    }
  }
}

Response for the first request

{
  "took": 30,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 19543,
    "max_score": 6.407021,
    "hits": [{
      "_index": "abc",
      "_type": "xyz",
      "_id": "ABC123",
      "_score": 5.8341036,
      "_source": {
        ...
      }
    }]
  }
}

Response for the second request

{
"took": 27,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
},
"hits": {
    "total": 19543,
    "max_score": 6.407021,
    "hits": [
        {
            "_index": "abc",
            "_type": "xyz",
            "_id": "FGH12343",
            "_score": 5.8341036,
            "_source": {
                ...
            }
        }
    ]
}

}

What could be the cause for this and how can I ensure the same results for different nodes?

Explained query as requested: search/abc/mytype/_search?from=0&size=1&search_type=dfs_query_then_fetch&explain=

{
  "query": {
    "bool": {
      "should": [{
        "match": {
          "address.city": {
            "query": "Karlsruhe",
            "boost": 2
          }
        }
      }]
    }
  }
}

Response for the first request

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 41,
        "max_score": 7.211497,
        "hits": [
            {
                "_shard": 0,
                "_node": "rUZDrUfMR1-RWcy4t0YQNw",
                "_index": "abc",
                "_type": "mytype",
                "_id": "abc123",
                "_score": 7.211497,
                "_source": {...
                },
                "_explanation": {
                    "value": 7.211497,
                    "description": "weight(address.city:karlsruhe^2.0 in 1598) [PerFieldSimilarity], result of:",
                    "details": [
                        {
                            "value": 7.211497,
                            "description": "fieldWeight in 1598, product of:",
                            "details": [
                                {
                                    "value": 1,
                                    "description": "tf(freq=1.0), with freq of:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "termFreq=1.0"
                                        }
                                    ]
                                },
                                {
                                    "value": 7.211497,
                                    "description": "idf(docFreq=46, maxDocs=23427)"
                                },
                                {
                                    "value": 1,
                                    "description": "fieldNorm(doc=1598)"
                                }
                            ]
                        }
                    ]
                }
            }
        ]
    }
}

Response for the second request

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 41,
        "max_score": 7.194322,
        "hits": [
            {
                "_shard": 0,
                "_node": "rUZDrUfMR1-RWcy4t0YQNw",
                "_index": "abc",
                "_type": "mytype",
                "_id": "abc123",
                "_score": 7.194322,
                "_source": {...
                },
                "_explanation": {
                    "value": 7.194322,
                    "description": "weight(address.city:karlsruhe^2.0 in 1598) [PerFieldSimilarity], result of:",
                    "details": [
                        {
                            "value": 7.194322,
                            "description": "fieldWeight in 1598, product of:",
                            "details": [
                                {
                                    "value": 1,
                                    "description": "tf(freq=1.0), with freq of:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "termFreq=1.0"
                                        }
                                    ]
                                },
                                {
                                    "value": 7.194322,
                                    "description": "idf(docFreq=48, maxDocs=24008)"
                                },
                                {
                                    "value": 1,
                                    "description": "fieldNorm(doc=1598)"
                                }
                            ]
                        }
                    ]
                }
            }
        ]
    }
}
s.Daniel
  • 1,064
  • 12
  • 29

1 Answers1

7

The hits mismatch is, most probably, because of an un-sync between the primary shards and the replica. This can happen if you had a node leaving the cluster (for whatever reason) but kept making changes to documents (indexing, deleting, updating).

The scoring part is a different story, and can be explained by "Relevancy Scoring" section from this blog post:

Elasticsearch faces an interesting dilemma when you execute a search. Your query needs to find all the relevant documents...but these documents are scattered around any number of shards in your cluster. Each shard is basically a Lucene index, which maintains its own TF and DF statistics. A shard only knows how many times "pineapple" appears within the shard, not the entire cluster.

I would give it a try, when searching, to "DFS Query Then Fetch", meaning _search?search_type=dfs_query_then_fetch .... that should help with the accuracy of scoring.

Also the different document count caused by document changes during the node disconnect affects the score calculation after even after deleting and rebuilding the index. This might be because changes to documents happened differently on the replica and on the primary shards, more specifically documents have been deleted. A deleted document is permanently removed from the index at segments merging time. And segments merging doesn't happen unless certain conditions are met in the underlying Lucene instance.

A forced merging can be initiated by a POST to /_optimize?max_num_segments=1. Warning: This takes a really long time (depending on the size of the index) and will require significant IO resources and CPU and should not be run on an index where changes are being made. Documentation: Optimize, Segments Merging

Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
  • Thanks, marked your answer as useful but some open questions remain: hits missmatch: The cluster status was always green. Wouldn't ES notice a split brain situation where something so obvious like the total document count is different? dfs: The parameter changes the score/ordering in comparison to no parameter but it still remains different. max_score 5.751141 vs 5.633818 - I have now added sorting (by _score, _uid) and get consistent sorting with changing scores which leaves the question why the score remains different. – s.Daniel May 26 '15 at 09:39
  • 1
    No, ES doesn't "notice" different document count. Also, a split brain is not noticed, as well. You need to make sure, to avoid split brain, that your configuration is correct and prevents this to happen. ES will not, pro-actively, do anything about split-brains, it's the users' job to do this. Do you have `minimum_master_nodes` set in your config file? – Andrei Stefan May 26 '15 at 10:10
  • "I have now added sorting (by _score, _uid) and get consistent sorting with changing scores which leaves the question why the score remains different." this I don't understand what you are saying. – Andrei Stefan May 26 '15 at 10:14
  • I have a two node setup and can't change this. With a two node setup I can't avoid split brain afaik. So I guess I should add a monitoring feature. As far as I understand the blog post with dfs_query_then_fetch the same query fired twice in a row should consistent results. This is not the case for me. First response will contain "max_score": 5.6822467 second will have "max_score": 5.6968164. Both times sending the same query and both thimes with the same url params: /_search?from=0&size=50&search_type=dfs_query_then_fetch – s.Daniel May 26 '15 at 11:34
  • 1
    You can avoid split-brain if minimum_master_nodes is set to 2. This means, though, that if one node goes down, that's it, your cluster will be stuck (it will need the second node to proceed to master election). The document with the highest score is the same in both query executions? – Andrei Stefan May 26 '15 at 12:02
  • Oh thanks. Yes the first document is the same but not the following unless I add the id as a sort criteria. Note: The first 42 documents within one response has the same score as they all match the city. This score is different between two request though. Hence the different max_score. This means that I still need to have the preferences=_local in the url to get consistent sorting when paging through the result list. – s.Daniel May 26 '15 at 12:16
  • Hm, interesting. How about running the queries with `?explain` and look at what is different in the computations for both runs? – Andrei Stefan May 26 '15 at 12:24
  • I noticed you updated the post. I see, though, that both runs have the same score: `"_score": 5.8341036`. Am I missing something? – Andrei Stefan May 26 '15 at 13:53
  • Sorry tried to keep the question short by editing the existing example but that didn't make sense after the edit. Apparently the idf seems to be the source: idf(docFreq=46, maxDocs=23427) vs idf(docFreq=48, maxDocs=24008) - I now ran /_optimize?max_num_segments=1 against the index and this fixed the different score issue. Apparently what happend was an unnoticed split brain. Then upon deleting and rebuilding the index the deleted documents were kept for idf calculation which is kind of confusing. Makes sense? – s.Daniel May 26 '15 at 15:14
  • Exactly ;-). You can avoid the split brain by specifying `minimum_master_nodes: 2` but the downside is your cluster will be functional if one node goes down. – Andrei Stefan May 26 '15 at 20:56
  • Correction to my previous comment: "but the downside is your cluster will **NOT** be functional if one node goes down." – Andrei Stefan May 27 '15 at 07:53
  • Thanks once more for taking the time walking me through this. I added some of our discussion to your answer. – s.Daniel May 27 '15 at 08:21
  • Sorry, my brain has been burnt from reading all your conversation :). I am having exactly the same problem. My one is even more dramatic, I get not only different score but different response for the same query. Is the conclusion simply add minimum_master_nodes: 2 into the config file? or anything extra should I do. – Emil Jul 22 '16 at 13:44
  • Have you tried using "preference" in your search URL to get scores from the same cluster for each search? See my answer to a similar question at https://stackoverflow.com/a/54478881/645042 – Sam Critchley Feb 01 '19 at 11:51