0

so I've written a tiny tool that (given a query) lists the top 1000 resulting documents ordered by their query score. Obviously, not all of them are relevant. As a user I and other people often do the following:

  1. Look at the scores
  2. Scroll down the list until you see "significant" loss in score.

For example the scores of the top docs are like this: 4.2, 3.9, 3.9, 3.85, ..., 3.7, 0.3, 0.3, 0.25, ... Often we could just say that all the documents until the 3.7 score are relevant, and all the remaining (starting wit 0.3) are not relevant. Given this list of scores this is even kind of obvious and luckily in our use case it just works fine.

Is there any state of the art algorithm to find such "gaps" / "losses" in a list of numbers (here scores)?

Following facts:

  • Top documents are always relevant
  • There is a point from which none (or almost none) document is relevant
  • This point can be identified by the first gap in score
pedjjj
  • 958
  • 3
  • 18
  • 40

1 Answers1

0

The naïve solution for your given sequence would be to make a cut after 3.7, but your algorithm will fail miserably on edge cases.

The problem with the score is that it is always relative and its numeric expression has very limited use. In fact, it is not even guaranteed to be the same for the same document on the same query providing the corpus has changed.

Also, there is nothing to assume that the first hit having score 4.2 is "significant". What if a query returns all weakly significant hits?

I'm just afraid there are no good solutions to this problem, mainly because lots of people think this isn't a big issue at all. Nobody cares if Google provides 199 or 200 pages of search results (and virtually nobody goes as far as that), so to me paging is the answer to this problem. You would not list all search results, would you?

mindas
  • 26,463
  • 15
  • 97
  • 154
  • Hi mindas, I'm asking this question because I want to implement an automated summarization. So given a query, I want to select only the top k documents and summarize them. Thus, the user would not see any documents but only their aggregation. Now the question is, how to really select only the best documents. I've made that clearer now in my question: I don't want to always cut after 3.7 but rather want to find these gaps. – pedjjj Jan 20 '15 at 14:57
  • The problem is that you don't have an algorithmic definition on where does the "best" start and finish. And nobody else apart from you can help here. Having this definition in algorithmic/formulaic terms would be a good start. – mindas Jan 20 '15 at 15:03
  • Thanks for your feedback! I've updated the question correspondingly – pedjjj Jan 20 '15 at 21:38