I am running nutch integrated with Solr for a search engine, the nutch crawl job happens on hadoop. My next requirement is to run a content categorisation job for this crawled content, how can I access the text content that is stored in HDFS for this tagging job, I am planning to run the tagging job with Java, how can I access this content through Java ?
Asked
Active
Viewed 278 times
2 Answers
0
The crawled content is stored in the data file in the segments directory for example:
segments\2014...\content\part-00000\data
The file type is a sequence file. To read it you can use code from the hadoop book or from this answer
0
Why don't you use Solr for categorization?
Just write your own plugin and categorize pages before sending them to Solr and store category value in Solr!

Mohsen ZareZardeyni
- 936
- 7
- 17