0

I have no idea what setDisableCoord is and what value should I set for it. I understand coord in a simple query (e.g. TFIDF query). But don't understand what it means in a Boolean query consisting of several queries.

To give some context, assume the following two scenarios. What value should I set in setDisableCoord for each of them?

  1. In the first scenario I have a query with BooleanClause.Occur.FILTER (the query is used only for filtering) and another one for scoring (BooleanClause.Occur.MUST). In this scenario the first query only checks if the "year" field of the document is in a specified range and the second query uses some algorithm for ranking.
  2. In the second scenario, I have two queries with BooleanClause.Occur.SHOULD whose scores must be combined to obtain the final retrieval score of documents.
Shayan
  • 2,758
  • 7
  • 36
  • 55

1 Answers1

3

Summary: For Lucene > 6.x, set disableCoord to true, otherwise leave it at false.

Coord is a scoring feature of BooleanQuery to counteract some of TF/IDFs shortcomings for over-saturated terms. It's only relevant for multiple should clauses. In your first scenario, all sub-queries must match, there is no coord factor involved and the disableCoord parameter has no effect. In the second scenario, when having multiple should clauses, a BooleanQuery sums up all the sub-scores to determine, which of the documents is a better match. The idea is that a doc that matches more sub-queries is a better match and thus, gets a better score.

Now, imagine a query x OR y and a document that has 1000 occurrences of x but none of y. With TF/IDF, due to the high termFreq(x), the sub-score of x is very high and so is the resulting score of x OR y, which can push this document before others, that match both fields, which is not what BooleanQuery was meant to do. This is where the coord comes into play.

The coord factor is calculated per document as number of should clauses matched/total number of should clauses in query. This basically gives a number in [0..1] that represents, how many sub-queries have matches a document. The summed score of all sub-queries is then multiplied by this coord factor. A document matching all should clauses will have it original score of all summed sub-queries and a document matching only x out of x OR y will have it's score halved, counteracting the high score that the over-saturated x gave. If you disabled coord, this factor will not be calculated and the final score is only the sum of the sub-scores.

Coord was designed with TF/IDF in mind and other similarity formulas might not suffer from over-saturated terms. BM25, which has become the default similarity in Lucene 6.0, has much better control over such over-satured terms, controlled by its k1 parameter. Instead of a score that grows near-linear with increasing termFreq, BM25 approaches a limit and stops growing. It gives no boost for documents that have a termFreq=1000 over one that has termFreq=5, but does so for termFreq=1 over termFreq=0. Britta Weber has given a talk at buzzwords about this, where she explains the saturation curve.

That means, for BM25, the coord factor is not necessary anymore and might actually lead to counter-intuitive results. It is already removed from Lucene master and will be gone in 7.0.

If you're using Lucene 6.x witht he default similarity BM25, it's a good idea to always disable the coord, as BM25 does not suffer from the problem coord worked around. If you're using TF/IDF (regardless of 6.x or not), disabling coord will only give you more predictable results as long as your term frequencies are evenly distributed (which they practically never are) and setting disableCoord to false (the default) will give results, that are intuitively better.

knutwalker
  • 5,924
  • 2
  • 22
  • 29
  • Only two points: 1- Could you give me a reference for "The coord factor is calculated per document as number of should clauses matched/total number of should clauses in query," 2- What is called a match in lucene? Consider a clause having a retrieval score of 10^-7. Is it a match? What is the threshold then? – Shayan Sep 12 '16 at 16:58
  • A match refers to whether or not a document exists in the total result set, unrelated to its score, it's a 0/1 decision. The score is just there to measure, how good a match is, but there is no threshold unless you implement one at application level. The BooleanWeight delegates its coord calculation to the Similarity class, with TF/IDF being the ClassicSimilarity (Implemented as overlap / maxOverlap): https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/search/similarities/ClassicSimilarity.html#coord-int-int- – knutwalker Sep 12 '16 at 19:31
  • The default Similarity just returns 1, which is what BM25 is actually using as well: The default Similarity just returns 1: https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/search/similarities/Similarity.html#coord-int-int- – knutwalker Sep 12 '16 at 19:32
  • But "The BooleanWeight delegates its coord calculation to the Similarity class" is different with "The coord factor is calculated per document as number of should clauses matched/total number of should clauses in query". 1- Which one is correct? 2- Will setting setDisableCoord to true cause the similarity used in all clauses (e.g., TFIDF) also not to use coord? – Shayan Sep 14 '16 at 18:22
  • The BooleanWeight delegates coord to the Similarity. Until Lucene 5.x, the default similarity (tf/idf) uses the `matched / total` factor, so both statements are true. Starting with Lucene 6.x, the default similarity has changed (bm25, no more tf/idf) and will now always return 1. Setting `disableCoord` to true will yield the very same result (always 1) on all Lucene versions and similarities. Since it no longer has any effect, the disableCoord parameter will be gone in 7.0. – knutwalker Sep 14 '16 at 18:51