0

I want to remove specific elements from the page response, before it is handed down to nutch. Specifically, I want to mark parts of my pages with i.e.

 <div class="noindex">I shall not be indexed</div>

And want to remove them before nutch parse, so that "I shall not be indexed" is not present in the NutchDocument afterwards. I plan die surround my navigation, header, footer content with this because right now, they are present in every document in the index.

Thanks, Paul

Paul Schyska
  • 666
  • 8
  • 17

1 Answers1

3

You have some alternativer for doing that:

josegil
  • 365
  • 1
  • 8