31

Idealy what I am trying to achieve is to assign weights to queries such that query1 constitutes 30% of the final score and query2 consitutes other 70%, so to achieve the maximum score a document has to have highest possible score on query1 and query2. My study of the documentation did not yield any hints as to how to achieve this so lets try to solve a simpler problem.

Consider a query in following form:

{
"query": {
    "bool": {
        "should": [
            {
                "function_score": {
                    "query": {"match_all": {}},
                    "script_score": {
                        "script": "<some_script>",
                    }
                }
            },
            {
                "match": {
                    "message": "this is a test"
                }
            }
        ]
    }
}
}

The script can return an arbitrary number (think-> it can return something like 12392002).

How do I make sure that the result from the script will not dominate the overall score?

Is there any way to normalize it? For example instead of script score return the ratio to max_script_score (achieved by document with highest score)?

JohnnyM
  • 1,273
  • 1
  • 13
  • 26
  • possible duplicate of [Combining hits from multiple documents into a single hit in Lucene](http://stackoverflow.com/questions/1393551/combining-hits-from-multiple-documents-into-a-single-hit-in-lucene) – Mark Leighton Fisher Aug 18 '14 at 13:00
  • @MarkLeightonFisher This is a compleatly different problem, not even remotely related! – JohnnyM Aug 18 '14 at 13:04
  • In the accepted answer, part of the solution is creation of a custom scoring class, which I thought was your question. By the way you could normalize over the whole range of hits -- that is, if you had scores from 49 to 27931, you could multiply each score to reduce them to a standard range (say, 0 to 1). You would need to do that calculation for each set of scores. – Mark Leighton Fisher Aug 19 '14 at 15:37
  • @MarkLeightonFisher The provided solution is not very helpful. First of all I'd rather stray away from modyfing the scoring class. Secondly the solution is for bare Lucene (and a very old version) not elasticsearch. Thirdly the problem they were trying to solve was quite different. Lastly I lack the source code that would help me to understand where can I access max_scores for different sub-queries. – JohnnyM Aug 20 '14 at 17:42
  • @JohnnyM : I think I have a connected issue, so I'm interested if you got any progress? Also, you say you lack source code to get the max_scores. In my problem this is because they are depended on the script and the values in the database, so knowing them would be hard to do beforehand. Is this the same in your case? – Nanne Apr 25 '16 at 09:35
  • This is a really good description of the question, as far as i'm conserned: *return the ratio to max_script_score (achieved by document with highest score)?* – Nanne Apr 25 '16 at 10:13
  • 1
    @Nanne I doubt this is possible out-of-the-box at the moment. See [this discussion](https://github.com/elastic/elasticsearch/issues/15670) regarding a subject that is very similar to what you seem to be looking for, especially [this comment](https://github.com/elastic/elasticsearch/issues/15670#issuecomment-170012318) from one of the ES core developers. – Andrei Stefan Apr 25 '16 at 13:06
  • @Nanne I settled on a workaround which was dry running individual queries and then would incorporate max score in the script. It was performant enough in my case but surely is not scalable and induces a lot of overhead. – JohnnyM Apr 26 '16 at 13:49

2 Answers2

9

Recently i am working on a problem like this too. I couldn't find any formal documentation about this issue but when i investigate the results with "explain api", it seems like "queryNorm" is not applied to the score directly coming from "functions" field. This means that you can not directly normalize script value.

However, i think i find a little bit tricky solution to this problem. If you combine this function field with a query like you do (match_all query) and give a boost to that query, normalization is working on this query that is, multiplication of this two scores - from normalized query and from script- will give us a total normalization. For a better explanation query will be like:

{
"query": {
    "bool": {
        "should": [
            {
                "function_score": {
                    "query": {"match_all": {"boost":1}},
                    "functions": [ {
                    "script_score": {
                        "script": "<some_script>",
                    }}],
                    "score_mode": "sum",
                    "boost_mode": "multiply"
                }
            },
            {
                "match": {
                    "message": "this is a test"
                }
            }
        ]
    }
}
}

This answer is not a proper solution to your problem but i think you can play with this query to obtain required result. My suggestion to you is use explain api, try to understand what it is returned, examine the parameters affecting final score and play with script and boost values to get optimized solution.

Btw, "rescore query" may help a lot to obtain that %30-%70 ratio on the final score: Official documentation

Heval
  • 338
  • 3
  • 11
  • 3
    +1 for the lead on rescore - this does not solve my problem, but can be useful in the future. Neat trick with the query boost, but this does not solve the problem either (and you are well aware of that:)) the problem lies in the relative difference of scores of script_queries and regular queries - it can be arbitrary (my research with the explain API clearly showed this). Btw, if you are looking to solve a similar problem please +1 mine:) – JohnnyM Aug 22 '14 at 17:12
1

As far as I searched, there is no way to get a normalized score out of elastic. You will have to hack it by making two queries. First will be a pilot query (preferably with size 1, but rest all attributes same) and it will fetch you the max_score. Then you can shoot your actual query and use functional_score to normalize the score. Pass the max_score you got as part of pilot query in params to function_score and use it to normalize every score. Refer: This article snippet

Genapshot
  • 41
  • 5