If you only want to avoid indexing old documents you could write your own IndexingFilter
that will check your condition and avoid the indexing of the documents. You don't mention your Nutch version, but assuming that you're using v1 we have a new PR (it will be ready for the next release) that will offer this feature out of the box using JEXL expressions to allow/prevent documents from being indexed.
If you can grab the PR and test it and provide some feedback would be amazing!
You could write your own custom plugin if you want, and you can check the mimetype-filter
for something similar to what you want (in this case we apply the filtering based on the mimetype).
Also a warning is in place, at the moment the fetchTime
or modifiedTime
that Nutch uses are coming from the headers that the webserver sends when the resource is fetched, keep in mind that these values should not be trusted (unless you are 100% sure) because in most cases you'll get wrong dates. NUTCH-1414 proposes a better approach to extracting the publication date from the content of the page, or you can implement your own parser.
Keep in mind that with this approach you still fetch/parse the old documents you'll just skip the indexing step.