0

This is my code to perform a PhraseQuery using Lucene. While it is clear how to get score matches for each document inside the index, i am not understanding how to extract the total number of matches for a single document. The following is my code performing the query:

        PhraseQuery.Builder builder = new PhraseQuery.Builder();

        builder.add(new Term("contents", "word1"), 0);
        builder.add(new Term("contents", "word2"), 1);
        builder.add(new Term("contents", "word3"), 2);
        builder.setSlop(3);
        PhraseQuery pq = builder.build();

        int hitsPerPage = 10;
        IndexReader reader = DirectoryReader.open(index);
        IndexSearcher searcher = new IndexSearcher(reader);

        TopDocs docs = searcher.search(pq, hitsPerPage);

        ScoreDoc[] hits = docs.scoreDocs;

        System.out.println("Found " + hits.length + " hits.");

        for(int i=0;i<hits.length;++i)
        {
            int docId = hits[i].doc;
            Document d = searcher.doc(docId);
            System.out.println(docId + " " + hits[i].score);
        }

Is there a method to extract the total number of matches for each document rather than the score?

Alex Torrisi
  • 97
  • 1
  • 8

1 Answers1

1

Approach A. This might not be the best way but it will give you a quick insight. You can use explain() function of IndexSearcher class which will return a string containing lots of information and phrase frequency in a document. Add this code inside your for loop:

System.out.println(searcher.explain(pq, searcher.doc(docId)));

Approach B. A more systematic way of doing this is to do the same thing that explain() function does. To compute the phrase frequency, explain() builds a scorer object for the phrase query and calls freq() on it. Most of the methods/classes used to do this are private/protected so I am not sure if you can really use them. However it might be helpful to look at the code of explain() in PhraseWeight class inside PhraseQuery and ExactPhraseScorer class. (Some of these classes are not public and you should download the source code to be able to see them).

vahid
  • 141
  • 7
  • I tried the Approach A as first and i obtained a phraseFreq=0.33333334 extracted from the `explain()` function. I was expecting an int as total number of matches. – Alex Torrisi Apr 04 '18 at 15:21
  • 1
    it is normalized by value of the slop. As an example, let's say your document is "X Y Z" and you set the slop=2. Then the `phraseFreq` of query "X Y" will be 1, but the `phraseFreq` for the query "X Z" will be 1/2=0.5. – vahid Apr 04 '18 at 23:56