Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant class (or classes) of the system that deal with the actual web page content extraction?
What I want to do is obtain the content of a web page Heritrix is about to crawl and then apply a classifier to the web page's content? (analyze structural features, etc.) I think this functionality may be distributed among the ContentExtractor class and its many subclasses, but what I'm trying to do is locate the point where I have either the web page content in its entirety or in a readable/parse-able stream. Where is the content (the html) that Heritrix applies regular expressions to (in order to find links, certain file types, etc.)?