Can someone help me finding the word frequency in all lucene index
for example if doc A has 3 number of word (B) and doc C has 2 of them, I'd like a method to return 5 showing the frequency of word (B) in all lucene index
Asked
Active
Viewed 1.2k times
3

hippietrail
- 15,848
- 18
- 99
- 158

Ehsan
- 1,662
- 6
- 28
- 49
-
what kind of an index size are you looking at? depending on that you might want to think of using Hadoop to do so, or a simple index parser to collect the word frequencies in a map. – anirvan Nov 12 '10 at 18:23
3 Answers
9
This has been asked multiple times:
3
Assuming you work with Lucene 3.x:
IndexReader ir = IndexReader.open(dir);
TermDocs termDocs = ir.termDocs(new Term("your_field", "your_word"));
int count = 0;
while (termDocs.next()) {
count += termDocs.freq();
}
Some comments:
dir
is the instance of Lucene Directory class. It's creation differs for RAM and Filesystem indexes, see Lucene documentation for details.
"your_filed"
is a filed to search a term. If you have multiple fields, you can run procedure for all of them or, alternatively, when you index your files, you can create special field (e.g. "_content") and keep there concatenated values of all other fields.

ffriend
- 27,562
- 13
- 91
- 132
1
using lucene 3.4
easy way to get the count, but you need two arrays :-/
int[] docs = new int[1000];
int[] freqs = new int[1000];
int count = indexReader.termDocs(term).read(docs, freqs);
beware: if you would use for read you are not able to use next() any more, because after the read() you are already at the end of the enumeration:
int[] docs = new int[1000];
int[] freqs = new int[1000];
TermDocs td = indexReader.termDocs(term);
int count = td.read(docs, freqs);
while (td.next()){ // always false, already at the end of the enumartion
}

Oliver
- 23
- 5