I am using ES 7.x version and basically requirement is to provide autosuggest / type ahead on large text field which have file/document content.
I have explored multiple way and for all it returns the entire source document or specific field if I restrict using _source. I have tried out edge ngram or n-gram tokenizer, Prefix Query, Completion suggestor.
Below is sample Document (content field might have 1000s sentences):
{
"content":"Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic).
Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of free and open tools for data ingestion, enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack (after Elasticsearch, Logstash, and Kibana),
the Elastic Stack now includes a rich collection of lightweight shipping agents known as Beats for sending data to Elasticsearch."
}
Below is expected output:
Search query: el
Output: ["elasticsearch","elastic","elk"]
Search Query: analytics e
Output: ["analytics engine"]
Currently I am not able to achieve above output using the OOTB functionality. So I have used highlighting functionality of elasticsearch and applied regex on result and created unique list of suggestion using Java.
Below is my current implement using highlight functionality.
Index Mapping:
PUT index
{
"settings": {
"index": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"analysis": {
"filter": {
"stop_filter": {
"type": "stop",
"stopwords": "_english_"
},
"ngram_filter": {
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
],
"min_gram": "1",
"type": "edge_ngram",
"max_gram": "12"
}
},
"analyzer": {
"text_english": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": [
"lowercase",
"stop_filter"
]
},
"whitespace_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
},
"ngram_analyzer": {
"filter": [
"lowercase",
"stop_filter",
"ngram_filter"
],
"type": "custom",
"tokenizer": "letter"
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"autocorrect": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
},
"analyzer": "text_english"
}
}
}
}
Below is Elasticsearch query which executed from Java
POST autosuggest/_search
{
"_source": "content.autocorrect",
"query": {
"match_phrase": {
"content.autocorrect": "analytics e"
}
},
"highlight": {
"fields": {
"content.autocorrect": {
"fragment_size": 500,
"number_of_fragments": 1
}
}
}
}
We have applied regex pattern on above query result.
Please let me know if there is any way to achieve without above workaround.