1

Can Elasticsearch's edgen_n_grams be set up in a way that will build multi-word phrases as ES indexes crawled data?

I'd like to use those multi-word phrases as search suggestions for a small search app that I'm building.

I'm using Nutch to crawl some sites and using ES to index the crawled data.

I figured that since ES can split on split on whitespace - that this shouldn't be that hard... however, I'm not getting the results I expected. So now I'm asking if this is even possible to do?

My ES index is setup like this

    PUT /_template/autocomplete_1
{
  "template": "auto*",
  "settings": {
   "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1   
    },
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": "1",
          "max_gram": "30",
          "token_chars": ["letter","digit","whitespace"]
        }
      },
      "analyzer": {
        "autocomplete_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
            ]     
          }
        }
      }
    },
    "mappings": {
      "doc": {
        "_all": {
          "enabled": false
      },
      "properties": {
        "anchor": {
          "type": "string"
        },
       "boost": {
          "type": "string"
       },
       "content": {
          "type": "string",
          "index_analyzer": "autocomplete_analyzer",
          "search_analyzer": "standard"
       },...

"content" is the html body field per Nutch. I'm using 'content' as I figured it would generate the most phrases.

user3125823
  • 1,846
  • 2
  • 18
  • 46
  • 2
    For creating multi-word phrases you need [shingles](https://www.elastic.co/blog/searching-with-shingles), but I'm not sure what kind of autocomplete you need. Do you have a sample document and a sample search text? – Andrei Stefan May 11 '16 at 18:39
  • @AndreiStefan, something to the effect of searching for a movie title like "the fast and the furios" or "fast 5" or "fast five" and the search query would be "f" - I'm reading up on shingles now – user3125823 May 12 '16 at 14:37
  • @AndreiStefan, I think this is exactly what I've been looking for! Put the info into an answer and I will accept it, thanks very much. – user3125823 May 12 '16 at 14:56

1 Answers1

1

For creating multi-word phrases you need shingles. More specifically, this token filter that can combine tokens.

Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89