Extract tf-idf vectors with lucene

Question

I have indexed a set of documents using lucene. I also have stored DocumentTermVector for each document content. I wrote a program and got the term frequency vector for each document, but how can I get tf-idf vector of each document?

Here is my code that outputs term frequencies in each document:

Directory dir = FSDirectory.open(new File(indexDir));
    IndexReader ir = IndexReader.open(dir);
    for (int docNum=0; docNum<ir.numDocs(); docNum++) {
        System.out.println(ir.document(docNum).getField("filename").stringValue());
        TermFreqVector tfv = ir.getTermFreqVector(docNum, "contents");
        if (tfv == null) {
        // ignore empty fields
        continue;
        }
        String terms[] = tfv.getTerms();
        int termCount = terms.length;
        int freqs[] = tfv.getTermFrequencies();

        for (int t=0; t < termCount; t++) {
        System.out.println(terms[t] + " " + freqs[t]);
        }
    }

Is there any buit-in function in lucene for me to do that?

Nobody helped, and I did it by myself:

    Directory dir = FSDirectory.open(new File(indexDir));
    IndexReader ir = IndexReader.open(dir);

    int docNum;
    for (docNum = 0; docNum<ir.numDocs(); docNum++) {
        TermFreqVector tfv = ir.getTermFreqVector(docNum, "title");
        if (tfv == null) {
                // ignore empty fields
                continue;
        }
        String tterms[] = tfv.getTerms();
        int termCount = tterms.length;
        int freqs[] = tfv.getTermFrequencies();

        for (int t=0; t < termCount; t++) {
            double idf = ir.numDocs()/ir.docFreq(new Term("title", tterms[t]));
            System.out.println(tterms[t] + " " + freqs[t]*Math.log(idf));
        }
    }

is there any way to find the ID number of each term?

Nobody helped, and I did it by myself again:

    List list = new LinkedList();
    terms = null;
    try
    {
        terms = ir.terms(new Term("title", ""));
        while ("title".equals(terms.term().field()))
        {
        list.add(terms.term().text());
        if (!terms.next())
            break;
        }
    }
    finally
    {
        terms.close();
    }
    int docNum;
    for (docNum = 0; docNum<ir.numDocs(); docNum++) {
        TermFreqVector tfv = ir.getTermFreqVector(docNum, "title");
        if (tfv == null) {
                // ignore empty fields
                continue;
        }
        String tterms[] = tfv.getTerms();
        int termCount = tterms.length;
        int freqs[] = tfv.getTermFrequencies();

        for (int t=0; t < termCount; t++) {
            double idf = ir.numDocs()/ir.docFreq(new Term("title", tterms[t]));
            System.out.println(Collections.binarySearch(list, tterms[t]) + " " + tterms[t] + " " + freqs[t]*Math.log(idf));
        }
    }

score 2 · Answer 1 · edited Mar 12 '13 at 15:29

2

You'll probably not found a tf-idf vector. But as you've already done, you can calculate IDF by hand. It is probably better to use the DefaultSimilarity (or whatever Similarity implementation you are using) to calculate it for you.

Regarding Term ID, I think currently you can't. At least not until Lucene 4.0, see this.

edited Mar 12 '13 at 15:29

condit

10,852
2
41
60

answered Feb 08 '12 at 12:57

Felipe Hummel

4,674
5
32
35

But all the terms are sorted and have a unique number in an index (their order)! How can I access that number for each term? – orezvani Feb 08 '12 at 14:11
If your index is static (you don't add more documents after initial batch index) you could use this sorted order as the term ID. First term, ID: 0, second term, ID: 1, an so on... If the need for Term IDs is external to lucene, you could also create this IDs outside of it. Iterate the Terms and store them separately from Lucene with its corresponding assigned (by you) ID. – Felipe Hummel Feb 09 '12 at 14:47
Yes but the problem is that, the speed of this method is really slow and runs me into a serious problem for over 10^6 documents. Do you have any idea? – orezvani Feb 29 '12 at 19:29
for every document it takes more than one second. It's not practical actually (for over 1milion document). – orezvani Mar 02 '12 at 20:45

Extract tf-idf vectors with lucene

1 Answers1

Linked