6

i have a Lucene-Index with following documents:

doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }

so these 5 documents use 14 different terms:

[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]

the frequency of each term:

[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]

for easy reading:

[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]

What i do want to know now is, how to obtain the term frequency vector for a set of documents?

for example:

Set<Documents> docs := [ doc2, doc3 ]

termFrequencies = magicFunction(docs); 

System.out.pring( termFrequencies );

would result in the ouput:

[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]

remove all zeros:

[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]

Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.

A naive implementation would be to just iterate over all documents in the docs set, create a map and count each term. But i need a solution that would also work with a document set size of 100.000 or 500.000.

Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?

I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.

Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.

ManBugra
  • 1,289
  • 2
  • 14
  • 20
  • 1
    So you have 500K documents, how big is your term list? – Justin May 27 '10 at 19:33
  • I know exactly what you're trying to accomplish, too bad I don't have an answer to your question :) – Esko May 27 '10 at 19:34
  • @Justin: i have around 2.000 different terms, absolute max in a few years maybe 10.000 but for sure not more. – ManBugra May 27 '10 at 19:46
  • hi ManBugra, I also have a similar requirement. Did you find any way to solve the problem of getting the count on a set of documents ? – Siva Feb 02 '11 at 11:03
  • @ManBugra : could you plz share , how to count term frequency? – Emma Jul 11 '13 at 12:33

2 Answers2

5

Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method

org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);

you will have to know the document id. This is an internal lucene id and it usually changes on every index update (that has deletes :-)).

I believe there is a similar method for lucene 2.x.x

Mihai Toader
  • 12,041
  • 1
  • 29
  • 33
0

I don't know Lucene, however; your naive implementation will scale, provided you don't read the entire document into memory at one time (i.e use an on-line parser). English text is about 83% redundant so your biggest document will have a map with 85000 entries in it. Use one map per thread (and one thread per file, pooled obviouly) and you will scale just fine.

Update: If your term list does not change frequently; you might try building a search tree out of the characters in your term list, or building a perfect hash function (http://www.gnu.org/software/gperf/) to speed up file parsing (mapping from search terms to target strings). Probably just a big HashMap would perform about as well.

Justin
  • 4,437
  • 6
  • 32
  • 52