0

I would like to manipulate the score I get when I do a search on elasticsearch. I already use the boost option, but it does not give me the results I would like to have. After some reading I think the function_score query is the solution to my problem. I understand how it works, but I can’t figure out how I can change my current query to use it with the function_score query.

"query": {
"filtered": {
    "query": {
        "bool": {
            "should": [{
                "multi_match": {
                    "type": "most_fields",
                    "query": "paus",
                    "operator": "and",
                    "boost": 2,
                    "fields": [
                        "fullname^2",
                        "fullname.folded",
                        "alias^2",
                        "name^2"
                    ],
                    "fuzziness": 0
                }
            }, {
                "multi_match": {
                    "type": "most_fields",
                    "query": "paus",
                    "operator": "and",
                    "boost": 1.9,
                    "fields": [
                        "taggings.tag.name^1.9",
                        "function",
                        "relations.master.name^1.9",
                        "relations.master.first_name^1.9",
                        "relations.master.last_name^1.9",
                        "relations.slave.name^1.9",
                        "relations.slave.first_name^1.9",
                        "relations.slave.last_name^1.9"
                    ],
                    "fuzziness": 0
                }
            }, {
                "multi_match": {
                    "type": "most_fields",
                    "query": "paus",
                    "operator": "and",
                    "fields": [
                        "fullname",
                        "alias",
                        "name"
                    ],
                    "boost": 0.2,
                    "fuzziness": 1
                }
            }, {
                "match": {
                    "extra": {
                        "query": "paus",
                        "fuzziness": 0,
                        "boost": 0.1
                    }
                }
            }]
        }
    },
    "filter": {
        "bool": {
            "must": [
                {
                    "terms": {
                        "type": ["Person"]
                    }
                },
                {
                    "term": {
                        "deleted": false
                    }
                }
            ]
        }
    }
}

As you can see we have four kinds of matches.

  • Boost 2: when there are exact matches on the name
  • Boost 1.9: when there are exact matches on the taggings
  • Boost 0.2: when there are matches on the name but with one character written wrong
  • Boost 0.1: when there are matches in the extra (description) field

The problem I am facing is that the matches with one character written wrong and no tagging score higher than the matches with the right tagging and the whole word written wrong. That should be the other way...

Any help would be appreciated :)

sehe
  • 374,641
  • 47
  • 450
  • 633

1 Answers1

1

There is no clear answer to this. Your best friend is Explain API,It will tell you how each and every document's score is calculated.

The most important thing to remember is boost is simply one of the factors considered while calculating score. From the Docs

Practically, there is no simple formula for deciding on the “correct” boost value for a particular query clause. It’s a matter of try-it-and-see. Remember that boost is just one of the factors involved in the relevance score; it has to compete with the other factors

It would help you a lot if you go through Theory and Lucene's Practical Scoring Function. This is the formula used by Lucene.

score(q,d)  =  
            queryNorm(q)  
          · coord(q,d)    
          · ∑ (           
                tf(t in d)   
              · idf(t)²      
              · t.getBoost() 
              · norm(t,d)    
            ) (t in q) 

Now One of the several reasons you are not getting expected results could be norm(t,d) and idf(t)². For e.g if you have extra field as paus me and other fields have something like my name is some paus something, that would give field length norm i.e norm(t.d) higher value. Also if you have say 10000 documents and only one document has paus in extra field, that would make Inverse Document Frequency pretty high because it is calculated as idf(t) = 1 + log ( numDocs / (docFreq + 1)) here numDocs=10000 and docFreq=1 and this value will be squared. I had exactly this problem in my dataset.

Fuzzy query scoring higher could be related to this issue which is basically a Lucene Issue. This is fixed in latest version.

One way that might work is giving constant_score to last two clauses and say a boost of 5 to first two clauses. This would help in understanding.

Try to solve this issue step by step, start with two clauses and see output of explain api, then try with three and finally all four. Also remove field boosting and try with query boost only. Gradually you will figure out.

I hope this helps!!

ChintanShah25
  • 12,366
  • 3
  • 43
  • 44