2

I have a Lucene index, and I need to access some statistics such as term collection frequency. BasicStats class has this information, however, I could not understand whether this class is accessible.

Is it possible to access BasicStats class in Lucene 4?

Elad Kravi
  • 43
  • 5

1 Answers1

2

BasicStats on it's own won't do much for you. About all it does is hold values for you, it doesn't have any of the intelligence to acquire that information.

BasicStats is intended to be used by the Similarity implementation, which generates all the information to put into it. The methods it uses to do this in the SimilarityBase are protected, but we can make use of the code there. To populate the BasicStats, you'll also need a CollectionStatistics and a TermStatistics, but really all you'll need to get those is the Term you are interested in, and an IndexReader:

public static BasicStats getBasicStats(IndexReader indexReader, Term myTerm, float queryBoost) throws IOException {
    String fieldName = myTerm.field();

    CollectionStatistics collectionStats = new CollectionStatistics(
            "field",
            indexReader.maxDoc(),
            indexReader.getDocCount(fieldName),
            indexReader.getSumTotalTermFreq(fieldName),
            indexReader.getSumDocFreq(fieldName)
            );

    TermStatistics termStats = new TermStatistics(
            myTerm.bytes(),
            indexReader.docFreq(myTerm),
            indexReader.totalTermFreq(myTerm)
            );

    BasicStats myStats = new BasicStats(fieldName, queryBoost);
    assert collectionStats.sumTotalTermFreq() == -1 || collectionStats.sumTotalTermFreq() >= termStats.totalTermFreq();
    long numberOfDocuments = collectionStats.maxDoc();

    long docFreq = termStats.docFreq();
    long totalTermFreq = termStats.totalTermFreq();

    if (totalTermFreq == -1) {
      totalTermFreq = docFreq;
    }

    final long numberOfFieldTokens;
    final float avgFieldLength;

    long sumTotalTermFreq = collectionStats.sumTotalTermFreq();

    if (sumTotalTermFreq <= 0) {
        numberOfFieldTokens = docFreq;
        avgFieldLength = 1;
    } else {
        numberOfFieldTokens = sumTotalTermFreq;
        avgFieldLength = (float)numberOfFieldTokens / numberOfDocuments;
    }

    myStats.setNumberOfDocuments(numberOfDocuments);
    myStats.setNumberOfFieldTokens(numberOfFieldTokens);
    myStats.setAvgFieldLength(avgFieldLength);
    myStats.setDocFreq(docFreq);
    myStats.setTotalTermFreq(totalTermFreq);

    return myStats;
}

If all you are after is one or two specific figures (that is, a call or two to IndexReader), this is probably overkill, but there it is.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • Thanks! What is should be the value of `queryBoost` parameter? Assuming I am interested in the statistics of a single term. – Elad Kravi Jul 10 '15 at 07:29
  • @EladKravi - That's the boost applied to the term in the query. Remember, the purpose of the `BasicStats` class is to track data for the Similarity impl, so it's using these statistics to calculate scores. Hmm, should have made that a float, actually... – femtoRgon Jul 10 '15 at 07:34