1

I am running nutch integrated with Solr for a search engine, the nutch crawl job happens on hadoop. My next requirement is to run a content categorisation job for this crawled content, how can I access the text content that is stored in HDFS for this tagging job, I am planning to run the tagging job with Java, how can I access this content through Java ?

2 Answers2

0

The crawled content is stored in the data file in the segments directory for example:

segments\2014...\content\part-00000\data

The file type is a sequence file. To read it you can use code from the hadoop book or from this answer

Community
  • 1
  • 1
Diaa
  • 869
  • 3
  • 7
  • 19
0

Why don't you use Solr for categorization?

Just write your own plugin and categorize pages before sending them to Solr and store category value in Solr!