I have setup Nutch 1.17 for crawling few website. As usual, there can be two type of web pages at high level. First those that are category pages or home pages that does not contain the details of any specific story but provide links and short text of multiple pages. Second, there are pages that contains information of complete story in detail i.e., articles.
Now my issue is how can I identify that this is actual article page and this page is a category page. Further, I am also interested to index only story pages ?
I think there isn't any thing in Nutch default. How could I achieve this behavior ?