Reject url's after fetching based on a condition in Nutch

Question

I want to know whether it's possible to filter the url's that are fetched, based on a condition (for example published date or time). I know that we can filter the url's by regex-urlfilter for fetching.

In my case I don't want to index old documents. So, if a document is published before 2017 then, it has to be rejected. Is there any date filter plugin needed or it's already available !

Any help will be appreciated. Thanks in advance.

score 1 · Accepted Answer · answered Sep 26 '17 at 10:50

If you only want to avoid indexing old documents you could write your own IndexingFilter that will check your condition and avoid the indexing of the documents. You don't mention your Nutch version, but assuming that you're using v1 we have a new PR (it will be ready for the next release) that will offer this feature out of the box using JEXL expressions to allow/prevent documents from being indexed.

If you can grab the PR and test it and provide some feedback would be amazing!

You could write your own custom plugin if you want, and you can check the mimetype-filter for something similar to what you want (in this case we apply the filtering based on the mimetype).

Also a warning is in place, at the moment the fetchTime or modifiedTime that Nutch uses are coming from the headers that the webserver sends when the resource is fetched, keep in mind that these values should not be trusted (unless you are 100% sure) because in most cases you'll get wrong dates. NUTCH-1414 proposes a better approach to extracting the publication date from the content of the page, or you can implement your own parser.

Keep in mind that with this approach you still fetch/parse the old documents you'll just skip the indexing step.

Thank you for your response. I already have a custom index filter plugin. Currently I added the date filter option in this plugin by skipping indexing of old documents. I have a parse plugin which extracts all relevant details from the site. So, I got the documents published date from the plugin and I filtered using this value in my indexer plugin. Is there any other options available ? Now, I am passing this document (not actually needed since it's old) in parsing and indexing stage. I want to skip it after the fetching stage. — Abhishek Ramachandran, Sep 27 '17 at 05:27
The issue is that the responsibility of the fetcher is only to *fetch* documents it doesn't take action about what is going to happen afterward. If you want to stay within the default Nutch behavior you still need to parse the document in order of getting useful info (date), and then you can decide what to do with the document. You could write your own fetcher, but is not so easy to maintain. Keep in mind that after parsing the old document you could still find valid outlinks to more recent documents. — Jorge Luis, Sep 27 '17 at 08:39

Reject url's after fetching based on a condition in Nutch

1 Answers1