Apache Solr topTerms (LukeRequestHandler) not giving correct token count

Question

I am using the Solr 4 trunk build, a couple days old.

According to the Wiki page for the LukeRequestHandler (first example output), we're supposed to get a count of the tokens for each or any specified field. I want to use this to make a count of the number of times each word in all my documents appears. For example, if the word 'is' appears in two MS Word documents, twice in the first and three times in the second, I would get an output like this:

<lst name="text">
  <str name="type">text</str>
  <str name="schema">IT-M---------</str>
  <str name="index">(unstored field)</str>
  <int name="docs">2</int>
  <int name="distinct">42</int>
  <lst name="topTerms">
    <int name="is">5</int>

That's because the word "is" occurs a total of five times across the two documents. However what I actually get is <int name="is">2</int>. I presume this is because it occurs distinctly (by document) a total of two times.

But again, according to the Wiki, we're supposed to get a total count, summed across all the documents, which is what I actually want.

How can I get a total number of times each and every word in all indexed documents appears?

Reference:

http://wiki.apache.org/solr/LukeRequestHandler

score 1 · Accepted Answer · answered Nov 12 '11 at 17:22

Doc frequencies returned by TermsComponent are the number of unique documents that match the term, including any documents that have been marked for deletion but not yet removed from the index.

TermVectorComponent provides the information about documents that is stored when setting the termVector attribute on a field.
TVC can return the term vector, the term frequency, inverse document frequency, and position and offset information.

tv.tf - Return document term frequency info per term in the document.

<lst name="termVectors">
  <lst name="doc-5">
    <str name="uniqueKey">MA147LL/A</str>
    <lst name="includes">
      <lst name="cable">
        <int name="tf">1</int>
      </lst>
      <lst name="earbud">
        <int name="tf">5</int>
      </lst>
      <lst name="headphones">
        <int name="tf">1</int>
      </lst>
      <lst name="usb">
        <int name="tf">1</int>
      </lst>
    </lst>
  </lst>
  ...............
</lst>

That's great, I'm finally getting total word counts but it's only for each document. Is there a way to get the total count of all the words in all the documents under one XML key? Otherwise of course I can programatically combine them but still I would imagine if Solr can do this with a specially formed query it'll be cheaper. Thanks. — deed02392, Nov 12 '11 at 18:54

Apache Solr topTerms (LukeRequestHandler) not giving correct token count

1 Answers1