4

How can I get the number of Hits per document in Lucene in Java. I have

 
   IndexReader reader;
   reader = IndexReader.open(FSDirectory.open(new File(index)), true);
   Searcher searcher = new IndexSearcher(reader);
   String feild = "contents"
   QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, field,analyzer);
   Query query = parser.parse("test");
   TopScoreDocCollector collector = TopScoreDocCollector.create(
                    5 * hitsPerPage, false);
   searcher.search(query, collector);
   ScoreDoc[] hits = collector.topDocs().scoreDocs;
   Searcher searcher = new IndexSearcher(reader);
   int numTotalHits = collector.getTotalHits();
   System.out.println(numTotalHits + " total matching documents");

for (int i = start; i < end; i++) { int id = hits[i].doc; TermFreqVector[] Tfv = reader.getTermFreqVectors(id);

The tfv is getting to be null :( Can some one direct on how to get the hits in each document from there.

EDIT:

If we set the TermVector.YES while indexing it works.

remo
  • 3,326
  • 6
  • 32
  • 50

2 Answers2

1

You can write custom Similarity implementation. You will gain access to term frequency which will give you number of times given terms occurs in given document.

Jarek Rozanski
  • 780
  • 1
  • 6
  • 13
  • can you direct me to an example? – remo Jan 06 '11 at 22:14
  • 1
    Just extends Similarity class. Implement tf(float frequency) method that stores frequency. Do not forget to attach you similarity to index searcher http://lucene.apache.org/java/3_0_3/api/all/org/apache/lucene/search/Searcher.html#setSimilarity%28org.apache.lucene.search.Similarity%29 – Jarek Rozanski Jan 06 '11 at 22:19
1

This is a duplicate of Get search word Hits ( number of occurences) per document in Lucene

As that answer says, you can use the term freq vector. jarekrozanski's answer is faster, but you will need to make a custom similarity class, which you might dislike doing.

Community
  • 1
  • 1
Xodarap
  • 11,581
  • 11
  • 56
  • 94
  • Well the link suggests us to use term freq vector for the feild which no more exists in 3.0 release for lucene. We can get it from the reader Object though, while it needs docNumber. Can you let me know what the document number is? – remo Jan 07 '11 at 15:50
  • @sharma: "docNumber" is just the ID of the doc, i.e. `reader.doc()` and `searcher.doc()` do the same thing. So, using your code, the doc id can be found as `hits[i].doc`. – Xodarap Jan 07 '11 at 15:57
  • @Xodarap: When I use IndexReader Object to get TermFreqVector it returns null for some reason. In 3.0 release is there any other object apart from IndexReader to get the TermFreqVector , that you know of? – remo Jan 07 '11 at 16:39
  • @sharma: Everything will be based off the reader. Are you sure you're passing in the correct field name? – Xodarap Jan 07 '11 at 17:10
  • I have updated the code here of what I have can you make if suggestions please let me know. I am sure that the feild name is right – remo Jan 07 '11 at 17:41
  • @Xodarap I got the number of hits for a single word like "Hello" using the TermDocs, can you suggest me a way if I can get two word seach like "Hello there". – remo Jan 07 '11 at 18:37
  • @sharma: could you try explicitly passing the field name in? e.g. `reader.getTermFreqVector(id, field)` – Xodarap Jan 07 '11 at 18:42
  • @Xodarap I have tried that. It still shows null. But works with reader.termDocs(). The problem with it, maches only single word like 'Hello' and not 'Hello world'. – remo Jan 07 '11 at 18:48
  • @sharma: yes, a term is a term. If you want to find the frequency of multiple terms, it is much harder. You can check out what the [highlighter does](http://www.docjar.org/html/api/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java.html). – Xodarap Jan 07 '11 at 19:22