0

I have some 100.000+ text documents. I'd like to find a way to answer this (somewhat ambiguous) question:

For a given subset of documents, what are the n most frequent words - related to the full set of documents?

I'd like to present trends, eg. a word cloud showing something like "these are the topics that are especially hot in the given date range". (Yes, I know that this is an oversimplification: words != topics etc.)

It seems that I could possibly calculate something like tf-idf values for all words in all documents, and then do some number crunching, but I don't want to reinvent any wheels here.

I'm planning on possibly using Lucene or Solr for indexing the documents. Would they help me with this question - how? Or would you recommend some other tools in addition / instead?

tuomassalo
  • 8,717
  • 6
  • 48
  • 50

1 Answers1

1

This should work: http://lucene.apache.org/java/3_1_0/api/contrib-misc/org/apache/lucene/misc/HighFreqTerms.html

This Stack Overflow question also covers term frequencies in general with Lucene.

If you were not using Lucene already, the operation you are talking about is a classic introductory problem for Hadoop (the "word count" problem).

halfer
  • 19,824
  • 17
  • 99
  • 186
Ray Toal
  • 86,166
  • 18
  • 182
  • 232
  • But can `HighFreqTerms` return stats for a subset of the whole index? (The same question goes for the Hadoop part.) – tuomassalo Sep 12 '11 at 17:02
  • For hadoop, yes, because in your mapper you write simple filtering code that simply skips certain documents. For Lucence, pass an instance of `org.apache.lucene.index.FilterIndexReader`. – Ray Toal Sep 12 '11 at 17:07
  • I'm interested to try Lucene and `HighFreqTerms`, but I couldn't find any examples on actually *using* `FilterIndexReader` to filter the dataset. Any pointers? – tuomassalo Sep 27 '11 at 06:16
  • I also didn't find any obvious examples on blogs, but [Google Code Search](http://www.google.com/codesearch#search/&q=FilterIndexReader%20lang:%5Ejava$&type=cs) (beyond page 1) had some, as did going to amazon.com and looking at Lucence in Action and "search inside this book" gave results. With Lucene, books are pretty helpful, IMHO. – Ray Toal Sep 27 '11 at 13:50