counting the word frequency in lucene index

Question

Can someone help me finding the word frequency in all lucene index
for example if doc A has 3 number of word (B) and doc C has 2 of them, I'd like a method to return 5 showing the frequency of word (B) in all lucene index

what kind of an index size are you looking at? depending on that you might want to think of using Hadoop to do so, or a simple index parser to collect the word frequencies in a map. — anirvan, Nov 12 '10 at 18:23

score 9 · Answer 1 · edited May 23 '17 at 11:52

9

This has been asked multiple times:

edited May 23 '17 at 11:52

Community

1
1

answered Nov 12 '10 at 19:47

Xodarap

11,581
11
56
94

ffriend · Answer 2 · 2010-11-12T19:58:50.863

Assuming you work with Lucene 3.x:

IndexReader ir = IndexReader.open(dir); 
TermDocs termDocs = ir.termDocs(new Term("your_field", "your_word"));
int count = 0;
while (termDocs.next()) {
   count += termDocs.freq();
}

Some comments:

dir is the instance of Lucene Directory class. It's creation differs for RAM and Filesystem indexes, see Lucene documentation for details.

"your_filed" is a filed to search a term. If you have multiple fields, you can run procedure for all of them or, alternatively, when you index your files, you can create special field (e.g. "_content") and keep there concatenated values of all other fields.

awfully `TermDocs` is not in lucene 5.3.1 which I use :( – inverted_index Nov 24 '16 at 19:02 — inverted_index, Nov 24 '16 at 19:02

score 1 · Answer 3 · answered Jul 17 '13 at 11:12

using lucene 3.4

easy way to get the count, but you need two arrays :-/

int[] docs = new int[1000];
int[] freqs = new int[1000];
int count = indexReader.termDocs(term).read(docs, freqs);

beware: if you would use for read you are not able to use next() any more, because after the read() you are already at the end of the enumeration:

int[] docs = new int[1000];
int[] freqs = new int[1000];
TermDocs td = indexReader.termDocs(term);
int count = td.read(docs, freqs);
while (td.next()){ // always false, already at the end of the enumartion
}

counting the word frequency in lucene index

3 Answers3

using lucene 3.4

Linked