How to get Document ids for Document Term Vector in Lucene

Question

I am new to Lucene world, and don't have much working knowledge of the subject. I need to extract document term vector and I found the following code online How to extract Document Term Vector in Lucene 3.5.0.

 /**
 * Sums the term frequency vector of each document into a single term frequency map
 * @param indexReader the index reader, the document numbers are specific to this reader
 * @param docNumbers document numbers to retrieve frequency vectors from
 * @param fieldNames field names to retrieve frequency vectors from
 * @param stopWords terms to ignore
 * @return a map of each term to its frequency
 * @throws IOException
 */
private Map<String,Integer> getTermFrequencyMap(IndexReader indexReader, List<Integer> docNumbers, String[] fieldNames, Set<String> stopWords)
throws IOException {
    Map<String,Integer> totalTfv = new HashMap<String,Integer>(1024);

    for (Integer docNum : docNumbers) {
        for (String fieldName : fieldNames) {
            TermFreqVector tfv = indexReader.getTermFreqVector(docNum, fieldName);
            if (tfv == null) {
                // ignore empty fields
                continue;
            }

            String terms[] = tfv.getTerms();
            int termCount = terms.length;
            int freqs[] = tfv.getTermFrequencies();

            for (int t=0; t < termCount; t++) {
                String term = terms[t];
                int freq = freqs[t];

                // filter out single-letter words and stop words
                if (StringUtils.length(term) < 2 ||
                    stopWords.contains(term)) {
                    continue; // stop
                }

                Integer totalFreq = totalTfv.get(term);
                totalFreq = (totalFreq == null) ? freq : freq + totalFreq;
                totalTfv.put(term, totalFreq);
            }
        }
    }

    return totalTfv;
}

I have created the index which resides in the following directory.

String indexDir = "C:\\Lucene\\Output\\";
Directory dir = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(dir);

My problem is that I do not know how to get the doc ids (List docNumbers) which is required for the above mentioned function. I have tried a couple of methods like

TermDocs docs = reader.termDocs();

but it did not work.

btw, how come you know what's a term frequency vector and you don't know anything about lucene document ids? — milan, Jan 20 '12 at 21:03
@milan I read that the Lucene does it automatically but the above code was a bit confusing as the "docNumbers" was passed as an argument. — Ahmad, Jan 30 '12 at 10:22

score 2 · Accepted Answer · answered Jan 20 '12 at 21:03

2

Lucene starts assigning ids from zero, and maxDoc() is the upper limit, so you can simply loop to get all ids, skipping deleted documents (Lucene marks them for deletion when you call deleteDocument):

for (int docNum=0; docNum < reader.maxDoc(); docNum++) {
    if (reader.isDeleted(docNum)) {
        continue;
    }
    TermFreqVector tfv = reader.getTermFreqVector(docNum, "fieldName");
    ...
}

For this to work, you have to enable them during indexing, see Field.TermVector.

answered Jan 20 '12 at 21:03

milan

11,872
3
42
49

Thank you for the reply I have tried it already but it did not work. Infact the problem was with the following line while creating index; doc.add(new Field("contents", new FileReader(f)); I have replaced it with the following line and it worked; doc.add(new Field("contents", new FileReader(f),Field.TermVector.YES)); – Ahmad Jan 30 '12 at 10:24

How to get Document ids for Document Term Vector in Lucene

1 Answers1