0

I understand that Azure Search ranks and scores using the TF-IDF algorithm. Unfortunately, this is causing us issues with how our results are returned, and thus far, custom scoring profile tweaks are not helping us.

Here's an example of the problem:

For simplicity's sake, let's say that our search documents only have two fields - IndividualName, and EntityName. Due to how our source datapoints are configured many of our records/documents (not all) have duplicate data in those two fields. This is unavoidable for how our architecture is set up.

Now let's say we do a search on John Anderson. Here is the query string:

searchMode=Any&search=+(%22John Anderson%22~3)&searchFields=IndividualName,EntityName&queryType=Full&$top=50&$count=true

Say we have two documents in the results - one has "Richard John Anderson" in both the IndividualName and EntityName field, and the second result has John Anderson, but only in the IndividualName field. The EntityName field is blank. The problem is that the Richard John Anderson document gets scored/ranked higher than the John Anderson document. I can only surmise this to be due to the TF-IDF algorithm, and it ranking Richard John Anderson higher because it sees it in the document twice.

As you can imagine, this makes no sense to us. We have to be able to bring back the John Anderson document as the highest ranked since this is the name that was searched on, not Richard John Anderson.

We tried this as the query to see if it would help but it does not:

search=+((IndividualName:"John Anderson" || EntityName:"John Anderson")^10 || (IndividualName:"John Anderson"~3 || EntityName:"John Anderson"~3))&searchFields=IndividualName,EntityName&queryType=Full

This is why the subject line of the thread asks how we can circumvent, or give less weight to, TF-IDF for our documents. To us, exact matches are more important than term frequency. Leaving the EntityName field out of the query is not an option. We have experimented some with custom scoring and field boosting, but thus far, to no avail. Hoping the MS Azure Search team can help out here.

Stpete111
  • 3,109
  • 4
  • 34
  • 74
  • You can find out how to prioritize exact matches in [this answer](https://stackoverflow.com/questions/39771652/azure-search-exact-match-as-first-or-single-result) – Eugene Shvets Aug 01 '17 at 21:16
  • @EugeneShvets-MSFT unfortunately the answer you've linked is only partially relevant to my example. As you can see in my second query example, we've already tried using boosting. We get the same results. Richard John Anderson still comes first, above John Anderson. One would have to admit that this simply does not make sense. With John Anderson as the search phrase, one would expect to see the document with "John Anderson" in the IndividualName field as the first result. I need yours or one of your colleagues continued assistance on this, please. – Stpete111 Aug 02 '17 at 00:07
  • @EugeneShvets-MSFT let me know if I need to formally open a support ticket in the portal. – Stpete111 Aug 02 '17 at 00:08
  • @EugeneShvets-MSFT one more comment - Janusz' answer refers specifically to single terms and term-boosting. Our clients have always, and will always, search using phrases (full names). This has to be taken into consideration. Please let me know if you and/or your colleagues can help. This is high priority for us and we can't launch our latest index into Production until we can get this resolved. – Stpete111 Aug 02 '17 at 13:47
  • Looping in @Yahnoosh to this. – Stpete111 Aug 02 '17 at 13:47

1 Answers1

1

In your example, both documents contain the exact phrase you are looking for "John Anderson". The search engine scores higher the document that matches the phrase more times, that's by design. If you want the phrase to match the entire content of the field, the best way would be to set indexAnalyzer to keyword.

To learn more about search query processing works in Azure Search, please read: How full text search works in Azure Search

Yahnoosh
  • 1,932
  • 1
  • 11
  • 13
  • Hi @Yanoosh thanks for the reply. I'm not sure if this will address our issue but I'll need to understand the Keyword analyzer better in order to determine. Is there any documentation which specifically speaks to the Keyword analyzer so I can understand it's function and ensure it won't take away from our current functionality? The document you linked in your answer does not specifically highlight it. At the end of the day, we just need to do whatever it takes, in essence, to put less weight on Term Frequency, or ignore it altogether. It's irrelevant to us for the type of data we sell. – Stpete111 Aug 03 '17 at 15:26
  • 1
    The keyword analyzer emits one token for the entire input stream - it doesn't break it. The document I shared explains what effect this will have on your ability to query for documents processed that way. Specifically, you query term will need to match the contents of the document exactly. You can find more information about analyzers here: https://learn.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-azure-search and test the behavior of analyzers using the Analyze API: https://learn.microsoft.com/en-us/rest/api/searchservice/test-analyzer – Yahnoosh Aug 03 '17 at 19:19
  • thank you for your continued attention, I appreciate it. Can the keyword analyzer be used in conjunction with proximity? In other words, we need to still be able to bring back records with "John G. Anderson." It seems the keyword analyzer would prevent this from happening, but if proximity of 2 is allowed in conjunction with keyword then we would be ok. Second, if we used the keyword analyzer would Richard John Anderson still be returned in the results? We would need it to be, just ranked lower than John Anderson. – Stpete111 Aug 03 '17 at 19:53
  • Hi @Yahnoosh, I have an additional question: if I were to search on John Anderson in Bing, aren't web pages with John Anderson always going to come back higher on the results list than web pages with Richard John Anderson? Not to oversimplify, but what do I need to do to get Azure Search to act more like Bing? – Stpete111 Aug 03 '17 at 23:35
  • 1
    Will they though? Both documents contain the phrase you are looking for. What you need, based on your description, is to rank higher documents that have only the phrase your are looking for and no other terms. In that case, create two fields, one processed in a standard way, one processed with the keyword analyzer, and boost matches in those accordingly using scoring profiles or term boosting like in your example. – Yahnoosh Aug 03 '17 at 23:56