1

Out of the box the nutch index writer for elasticsearch generates an index in elasticsearch with the name provided in nutch-site.xml (or nutch-default.xml) in the property element:

   <property> 
     <name>elastic.index</name>
     <value>nutch</value> 
     <description>Default index to send documents to.</description>
   </property>

The mappings section in elasticsearch for such an automatically generated index always has the following structure

   {
       "nutch": {
           "mappings": {
               "doc": {
                   "properties": {
                       "anchor": {
                           "type": "string"
                       },
                       "boost": {
                           "type": "string"
                       },
                       "cache": {
                           "type": "string"
                       },
                       "content": {
                           "type": "string"
                       },
                       "contentLength": {
                           "type": "string"
                       },
                       "date": {
                           "type": "date",
                           "format": "dateOptionalTime"
                       },
                       "digest": {
                           "type": "string"
                       },
                       "host": {
                           "type": "string"
                       },
                       "id": {
                           "type": "string"
                       },
                       "lang": {
                           "type": "string"
                       },
                       "lastModified": {
                           "type": "date",
                           "format": "dateOptionalTime"
                       },
                       "segment": {
                           "type": "string"
                       },
                       "title": {
                           "type": "string"
                       },
                       "tstamp": {
                           "type": "date",
                           "format": "dateOptionalTime"
                       },
                       "type": {
                           "type": "string"
                       },
                       "url": {
                           "type": "string"
                       }
                   }
               }
           }
       }
   }
  1. Where is the template for this?
  2. Can it be changed?
  3. If yes, which fields are mandatory and which are optional?
  4. Where can I find more information on this?

Any help appreciated! Thanks, Wolfram

Val
  • 207,596
  • 13
  • 358
  • 360
wbartussek
  • 1,850
  • 1
  • 10
  • 8

1 Answers1

2

Welcome to StackOverflow !!

Here's my take at your questions:

  1. It doesn't look like Nutch creates any template. Here is the source code for ElasticIndexWriter and as you can see there's no reference to any template anywhere.

  2. Since Nutch doesn't create any index template, you can't change it... but you can definitely create one yourself directly in your ES cluster, if you want/need to control the mapping of certain fields.

You can start off the default mapping created by Nutch (i.e. the one you've pasted in your question) and iterate on that. Creating a template out of it is trivial, i.e. you just add the "template": "nutch*" property (first line below) and you're good to go (some more info available on how to change mappings available here):

curl -XPUT localhost:9200/_template/nutch_template -d '{
  "template": "nutch*",
  "mappings": {
    "doc": {
      "properties": {
        "anchor": {
          "type": "string"
        },
        "boost": {
          "type": "string"
        },
        "cache": {
          "type": "string"
        },
        "content": {
          "type": "string"
        },
        "contentLength": {
          "type": "string"
        },
        "date": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "digest": {
          "type": "string"
        },
        "host": {
          "type": "string"
        },
        "id": {
          "type": "string"
        },
        "lang": {
          "type": "string"
        },
        "lastModified": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "segment": {
          "type": "string"
        },
        "title": {
          "type": "string"
        },
        "tstamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "type": {
          "type": "string"
        },
        "url": {
          "type": "string"
        }
      }
    }
  }
}'

3-4. There is a description of all the fields indexed/stored by Nutch in their wiki, so you can modify the mapping above in order to store/index certain fields differently to match your exact needs.

Note: make sure to wipe your current nutch index first, then create your template (point 2 above) and then when Nutch will index its first document, the index will be created automatically.

You might also be interested in looking into the issue FLUME-2787 as someone else seems to have gone through template creation himself. You might find some nuggets in there.

Val
  • 207,596
  • 13
  • 358
  • 360
  • In fact I went through the source code of the index writer first - and as you said there is no reference to a template. I think I was missing the list of fields that are indexed/stored by Nutch you mentioned (in their wiki). So, also depending on what you enabled in the plugins-list in nutch-site.xml, the nutch index writer will generate elasticsearch mappings based on such a field list. By inspecting the generated mappings one could then also conclude which plugins where successfully enabled (or not). The resulting mappings can then in turn be found in elasticsearch as I did; right? – wbartussek Dec 03 '15 at 17:43