3

Im looking to categorize lots of websites (millions). I can use Nutch to crawl them and get the content of the sites, but I am looking for the best (and cheapest or free) tool to categorize them.

One option is to create regular expressions that look for certain keywords and categorize the sites, but there area also high end LSI type tools like Autonomy. Are there any open source or cheaper tools that will take the text from a webpage/site and categorize it for me? I need some customization on the types of categories used. As part of the categorization I would like to be able to recognize "fake" sites that are really just parked pages, or domainers that are putting ads on the pages as well as just plain old categories, like is this news, sports, science, health, food, entertainment etc...

Joelio
  • 4,621
  • 6
  • 44
  • 80
  • did you manage to succeed with this project? Did you manage to classify the "fake" sites? – Leeor Sep 03 '13 at 13:26
  • 1
    for that project, we ended up just using regular expressions, but I would still like to find something like what I was looking for. – Joelio Sep 03 '13 at 14:08
  • is it fast way to use nutch for text extraction and can we use nutch for categorization or any other purpose? – Divyang Shah Apr 10 '15 at 08:19

0 Answers0