1

I am attempting to figure out the best way to solve this problem. Lets say I have a user who is typing in a short sentence, and I want to match this sentence (a query essentially) to a small set of documents that are assigned to the user. The issue I am facing, is that unlike a Google Search, where a list of highly relevant to lowly relevant documents make sense, I want to choose a subset of these documents automatically without user intervention. Is there any way to filter out 'low relevance' documents?

In researching this, the answer seems to be no, since the _score from elasticsearch is not on a consistent scale from query to query (and the documentation states min_score is stupid to use). Is there a way to filter out results that do not have a _score of at least 90% of the max _score for that given query (I'm sure this can be done in a language processing the results, was curious if ES provides this through some built in functionality)? What about filtering documents that did not match more than one term (so documents matching in only one term of the query are dropped out)?

Thanks for any insight!

mrquintopolous
  • 157
  • 3
  • 9
  • 1
    could you give us couple of sample documents and what exactly do you want so that we can understand better? – ChintanShah25 Jan 20 '16 at 22:35
  • This might be a bit contrived, but: lets say that the user is typing in "I really love the new android samsung phones", and the documents in question are along the lines of short titles like this: "Everything you need to know about android phones", "Samsung and LG Phones", "Love And Everything Else". The first two would have high relevancy on matching two terms, while the other one is lower (matching one term). So I would be trying to filter out the lesser-relevant ones (I know this is probably an odd task to do in general, just curious if anyone has thoughts on something like this) – mrquintopolous Jan 22 '16 at 03:13

1 Answers1

0

What about the Minimum Should Match option?

Igor Belo
  • 718
  • 6
  • 14
  • That seems close, but maybe I am not understanding what an optional clause is. In this case, I would want to say the document needs to match on more than one term, not so much extra clauses in the query. Does that make sense? Still learning the ES lingo – mrquintopolous Jan 20 '16 at 21:10
  • Yes. You can say something blanket like "2 or more terms must match", or you can say "75% of terms must match", or you construct a composite policy that explicitly names the percent or number of terms that must match for queries with explicit numbers of terms. – Peter Dixon-Moses Jan 24 '16 at 02:06
  • @PeterDixon-Moses have any pointers or links where I can find more on how to set that up in a query? – mrquintopolous Jan 25 '16 at 15:13
  • See the link in Igor Belo's response above – Peter Dixon-Moses Jan 26 '16 at 03:16
  • Will mark this as 'correct', even though in some sense there is no true answer to this, as every situation will be different. I'll look into other options but this looks like a good option to trim out the fat from search results that only hit a few terms. Thank you (and @PeterDixon-Moses) for your help! – mrquintopolous Jan 26 '16 at 16:40
  • Minimum-match is a quick and dirty heuristic approach. Tuning search relevance is a bit of a dark art that requires in-depth knowledge of your domain, your dataset, and ideally some indication of which results your users find to be relevant to their queries. (Look into Named Entity Recognition (NER) if you truly have a product search case where you want to improve precision.) – Peter Dixon-Moses Jan 26 '16 at 16:49
  • Yea, I was going to hope that I didn't need to go down the NLP road, but NER or even just parsing out nouns might help in my use case. But you are right, I think heuristic approaches always require domain knowledge :) – mrquintopolous Jan 26 '16 at 21:00