5

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:

<div id=something>
      me specific tag
</div>

indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.

any idea?

Amir
  • 341
  • 1
  • 5
  • 16

4 Answers4

3

I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

Here are some tips to plugin:

  • read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
  • in your plugin extend the ParseFilter and IndexingFilter.
  • in YourParseFilter you can use NodeWalker to find your specific div
  • your parsed informations put into page metadata like this

    page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));

  • in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

    doc.add("your_specific_tag", value);

  • most important!!!!!

  • put your_specific_tag to fileds of:

    • Solr config file schema.xml (and restart Solr)

    field name="your_specific_tag" type="string" stored="true" indexed="true"

    • Nutch config file schema.xml (don't know if it is realy neccessary)
    • Nutch config file solrindex-mapping.xml

    field dest="your_specific_tag" source="your_specific_tag"

Babu
  • 4,324
  • 6
  • 41
  • 60
  • I've done this also, but somehow, some metadata gets lost in the proccess. I look for it in the IndexingFilter, getMetadata().get("my_tag") returns null – Nivaldo Bondança Feb 04 '15 at 16:03
2

u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...

Arul Pandian
  • 1,685
  • 15
  • 20
1

You can use one of these custom plugins to parse xml files based on xpath (or css selectors):

tahagh
  • 777
  • 7
  • 8
0

You may want to check Nutch Plugin which should allow you to extract an element from a web page.

javanna
  • 59,145
  • 14
  • 144
  • 125
Jayendra
  • 52,349
  • 4
  • 80
  • 90