3

I'm working on a project that uses Google App Engine's text search API to allow users to search for documents that include a words field. I'm sorting using a MatchScorer, which according to the documentation "assigns a score based on term frequency in a document".

When a user enters a query like "business promo", I convert this into a query string that looks like words:business OR words:promo. I would have expected that this would return documents that contain both the words "business" and "promo" before documents that only contain one of the words (since the documentation says it assigns a score based on term frequency in the document). However, I frequently see results that contain only one of the words before documents that contain both.

I've also tried querying using the RescoringMatchScorer, but see the same problem using this scorer.

I've thought about doing separate queries - ones that AND the search terms and ones that OR the search terms - but this would require many queries if the user enters more than two search terms. For example, if I searched for "advanced business solutions", I'd need queries like this to cover all the bases:

words:advanced AND words:business AND words:solutions
words:advanced AND words:business
words:advanced AND words:solutions
words:business AND words:solutions
words:advanced OR words:business OR words:solutions

Does anyone have any hints on how to perform searches that return more relevant results (i.e. more search term matches) before less relevant results?

Greg
  • 33,450
  • 15
  • 93
  • 100
  • Hey! I hope you are doing great as its been 4 years now, so I hope you figured it out at that time; so what was your solution? as I want to sort in the following manner ; First priority: "advanced business solutions" three words in the same order ; second priority: all of these three words in any order but they must appear consecutively ; third priority: all of these three words must appear in any order in the whole document ; fourth priority: any of these words may appear in the document – Saim Abdullah Dec 24 '18 at 06:16

1 Answers1

0

Perhaps it depends on how you interpret the phrase "term frequency". I think you're interpreting it to mean "how many of my search terms appear in the document". But it could also mean "how many times (any of) the search terms appears in each document", and indeed -- at least according to some simple experiments I've done -- the latter seems to be the actual behavior.

For example, a document that contains the word "business" 20 times and never mentions the word "promo" would be scored higher than a document that contains "business" and "promo" only once each. Does that jibe with the behavior you're seeing?

Alan
  • 690
  • 3
  • 6
  • Yes, that does jibe with the behaviour I'm seeing. However, shouldn't a document that matches both "promo" and "business" once each have a higher score than a document that only matches "business" once? I'm seeing exactly the same sort_score returned for both of these cases, which seems wwrong. – Greg Apr 11 '14 at 19:27