I have many documents, over ten thousands (maybe more). I'd like to extract some keywords from each document, let's say 5 keywords from each document, using hadoop. Each document may talk about a unique topic. My current approach is to use Latent Dirichlet Allocation (LDA) implemented in Mahout. However as each document talks about a different topic, the number of extracted topics should be equal to the number of documents, which is very large. As the LDA become very inefficient when the number of topics become large, my approach is to randomly group documents into small groups each having only 100 documents and then use Mahout LDA to extract 100 topics from each group. This approach works, but may not be very efficient because each time I run hadoop on small set of documents. Does anyone has a better (more efficient) idea for this?
Asked
Active
Viewed 219 times
0
-
While I think this question is better suited to [Cross Validated](http://stats.stackexchange.com), as it concerns search algorithms, I'd look into term frequency–inverse document frequency (TF-IDF), which can help determine the importance of keywords in a document, controlling for the length of the document. A Mahout or MapReduce job can calculate TF-IDF across lots of documents, and return the top values per document. – economy Apr 14 '15 at 23:44
-
Are you suggesting to only use TF-IDF and skip the LDA algorithm?! But this will not be accurate. – HHH Apr 14 '15 at 23:51
-
It depends on your goal. TF-IDF is implemented in search engine ranking algorithms across huge datasets, so I wouldn't be quick to label it inaccurate. Again, this is a question for another forum. – economy Apr 14 '15 at 23:54