Elasticsearch autocomplete on large text field

Question

I am using ES 7.x version and basically requirement is to provide autosuggest / type ahead on large text field which have file/document content.

I have explored multiple way and for all it returns the entire source document or specific field if I restrict using _source. I have tried out edge ngram or n-gram tokenizer, Prefix Query, Completion suggestor.

Below is sample Document (content field might have 1000s sentences):

{
    "content":"Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. 
Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic). 
Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of free and open tools for data ingestion, enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack (after Elasticsearch, Logstash, and Kibana), 
the Elastic Stack now includes a rich collection of lightweight shipping agents known as Beats for sending data to Elasticsearch."
}

Below is expected output:

Search query: el

Output: ["elasticsearch","elastic","elk"]

Search Query: analytics e

Output: ["analytics engine"]

Currently I am not able to achieve above output using the OOTB functionality. So I have used highlighting functionality of elasticsearch and applied regex on result and created unique list of suggestion using Java.

Below is my current implement using highlight functionality.

Index Mapping:

PUT index
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 1
    },
    "analysis": {
      "filter": {
        "stop_filter": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "ngram_filter": {
          "token_chars": [
            "letter",
            "digit",
            "symbol",
            "punctuation"
          ],
          "min_gram": "1",
          "type": "edge_ngram",
          "max_gram": "12"
        }
      },
      "analyzer": {
        "text_english": {
          "type": "custom",
          "tokenizer": "uax_url_email",
          "filter": [
            "lowercase",
            "stop_filter"
          ]
        },
        "whitespace_analyzer": {
          "filter": [
            "lowercase"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        },
        "ngram_analyzer": {
          "filter": [
            "lowercase",
            "stop_filter",
            "ngram_filter"
          ],
          "type": "custom",
          "tokenizer": "letter"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "fields": {
          "autocorrect": {
            "type": "text",
            "analyzer": "ngram_analyzer",
            "search_analyzer": "whitespace_analyzer"
          }
        },
        "analyzer": "text_english"
      }
    }
  }
}

Below is Elasticsearch query which executed from Java

POST autosuggest/_search
{
  "_source": "content.autocorrect",
  "query": {
    "match_phrase": {
      "content.autocorrect": "analytics e"
    }
  },
  "highlight": {
    "fields": {
      "content.autocorrect": {
        "fragment_size": 500,
        "number_of_fragments": 1
      }
    }
  }
}

We have applied regex pattern on above query result.

Please let me know if there is any way to achieve without above workaround.

The completion suggester is not the right tool for your use case. Have you tried the `match_phrase_prefix` query which matches full tokens next to one another and the last one as a prefix? — Val, Feb 04 '22 at 12:31

score 2 · Answer 1 · answered Feb 04 '22 at 12:34

2

The completion suggester is not the right tool for your use case. As well ngrams and edge ngrams might be overkill depending on how much content you have.

Have you tried the match_phrase_prefix query which matches full tokens next to one another and the last one as a prefix?

The query below is very simple and should work the way you expect.

POST test/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": "analytics e"
    }
  }
}

answered Feb 04 '22 at 12:34

Val

207,596
13
358
360

Is this will return the output as i expecting ? Like query which you have written it will not return entire content field and give “analytics engine” as suggestion ? – Sagar Patel Feb 04 '22 at 14:51
1

You still need the highlight section, for sure, I'm just focusing on the query, to make sure that it will match exactly what you want. But you don't really need the ngram analyzer and all that stuff IMHO, you can just match the content field as is – Val Feb 04 '22 at 14:52
Got it. Is there something which can be achievable using aggregation ? – Sagar Patel Feb 04 '22 at 15:40
This is not an aggregation problem, but a search problem. The highlighter with a smaller fragment size should provide pretty much was you need – Val Feb 04 '22 at 15:41
`match_phrase_prefix` is working with highlighting and without Ngram which will help to reduce index space. – Sagar Patel Feb 07 '22 at 06:20
I am asking about aggregation because is there way we can use shingle and then use team aggregation to generate aggregation and we can use output for autocomplete. something similar mention on this [post](https://stackoverflow.com/questions/43087205/autocomplete-suggestions-from-article-content) answer but its not clear to me. – Sagar Patel Feb 07 '22 at 06:23
I have tried this and it is working but now problem is, when i type `hello w` it is returning `hellow world` as highlighed and i can show as suggestion. But when i type `hello ` then it is not returning `hello world` as highlghted. can you please let me know how i can achive it. – Sagar Patel Mar 23 '22 at 11:54
1

Indeed, because there's no prefix of the second term to match... In this case, maybe you could simply tokenize your content with an [edge-ngram tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html) (min 2 / max 20) that would index all prefixes of your content, then a simple `match` query on the input value would do the job – Val Mar 23 '22 at 12:01
but this is also not highlghting next term. basically i want to highlght next term so if user enter `hello ` then it should hgihlght `hello world`. also, field value will be large text so it should highlht only next term. – Sagar Patel Mar 23 '22 at 12:16
I'm not sure how you want to highlight a term that the user has not entered yet :-/ – Val Mar 23 '22 at 12:31
i think so it is not possible. but why i want to do is to suggest next term to user as we are using highlighted term for autosuggestion. i think so i should post seperate question for same. – Sagar Patel Mar 23 '22 at 12:38
Yes, it's a slightly different issue – Val Mar 23 '22 at 13:08
You might also want to maybe test with the `search_as_you_type` field type and a `match/bool_prefix` query. The only issue is that's it's also doing infix searches, but you should try it anyway – Val Mar 24 '22 at 09:46
is this going to work with large text feild like content ? – Sagar Patel Mar 24 '22 at 10:43
It might, but it's also going to produce a bigger index. Besides, it's not really an "autocomplete" use case, because autocomplete usually works from the beginning of a string and people are usually only typing a few characters to find something. Autocompletion doesn't make sense on large body of text – Val Mar 24 '22 at 10:55
we dont want to show entire content of large text of article but atleast if someone type one or two word then it should suggest next word as auto suggest. as i mentioned in original question. – Sagar Patel Mar 24 '22 at 10:58
Hi! I know the post is old but I would like to know which approach you chose? I use both completion suggester and aggregation/shingle. About the aggregation/shingle, the use of a regex can cause a cost that I'm assuming ("include":"term.*"), I know it's terrible. And there's the fact of increasing storage because I index title, description, tags and 3 other fields. I'll try to try match_phrase and highlight. – rabbitbr May 31 '22 at 13:18
Explaining better, I copy the title, description, tags and other 3 fields to a single field (through copy_to), and in this field I index using the shingle and apply the query bool prefix and the aggregation. – rabbitbr May 31 '22 at 13:25
@rabbitbr did you enable fielddata to apply aggregation on single field which is created using copy_to. – Sagar Patel Aug 24 '22 at 11:48
Yes, I enabled fielddata to field. – rabbitbr Aug 24 '22 at 12:58

Elasticsearch autocomplete on large text field

1 Answers1