0

I'm using Nutch 1.6 and Solr 4.3 on Ubuntu Server 12.04 I would like to switch on and off content indexing. Is there a way to specify this behaviour in my HTML pages so that Solr can behave accordingly ?

As an example, when using Google Search Appliance I would use "googleon" - "googleoff" tags around the content on the page that i don't want indexed (headers, footers, copyright strings, etc ).

thank you

MarioCannistra
  • 275
  • 3
  • 12

2 Answers2

3

You wil need to create a custom plugin for Nutch to be able to accomplish this behavior. Below are some relevant links with examples.

Community
  • 1
  • 1
Paige Cook
  • 22,415
  • 3
  • 57
  • 68
  • 1
    The second link is very clear in what needs to happen. I have an implementation just like it to target custom tags injected by our template system so I imagine writing a similar plugin will do the trick for you, Zander. – Butifarra May 17 '13 at 18:45
  • Thank you Paige and Claude. Will try this approach. – MarioCannistra May 20 '13 at 06:54
0

There is a text file, "robots.txt" that provide information to the search engines about which html pages the program is allowed or not to look for content. In the link FAQ robots.txt: How to stop indexing you will find all the information.

alfeliz
  • 1
  • 2
  • that file controls crawlers' activity in the web folder it is placed. Instead, I'm referring to a way to control indexing inside a page with tags (please google for the tags googleoff / googleon for more details) – MarioCannistra May 17 '13 at 10:27