5

Our goal
We would like to give our users the ability get search suggestions as they start typing, but the ElasticSearch suggesters don't offer anything that seems to fit our usecase of getting suggestions for snippets of text from articles. Ngramming and searching the titles of the documents are fine for indices with a lot of titles with great variation, but for a small number of articles, the titles just doesn't represent enough information and lots of search phrases return zero results. We also cannot have the users tag all the documents with relevant suggestion clues.

Our documents typically consist of a title and a description (body) plus various other properties like groups, categories and departments.

Our current solution: shingles in a separate index
Every time we index a document, we call elasticsearch _analyze endpoint to generate the shingles (2-5) for the description + title of the document. Each result (shingles produce a huge number of results) is then stored as a field called Suggestion in a copy of the original document in a new index. This is because someone users might want narrow down suggestions for documents that belong to a certain category or any other arbitrary filtering that we give the option to supply.

Original document (Main index):

{
    "Title": "A fabulous document",
    "Description": "A document with fabulous content"
    "Category": "A"
}

Suggestion documents (Suggestion index)

(Suggestion 1)
{
    "Title": "A fabulous document",
    "Description": "A document with fabulous content",
    "Category": "A"
    "Suggestion": "A"

}
(Suggestion 2)
{
    "Title": "A fabulous document",
    "Description": "A document with fabulous content",
    "Category": "A"
    "Suggestion": "A document"

}
...
(Suggestion N)
{
    "Title": "A fabulous document",
    "Description": "A document with fabulous content",
    "Category": "A"
    "Suggestion": "a document with"

}

But as you can see, for an article of 1000 words, we could easily get hundreds or thousands of shingles, each duplicating the entire main document.

To search, we do a prefix search in the suggestions documents and a terms aggregation to get the word combinations that appear most frequently and our users actually kind of like this solution as long as they don't have anything better.


Another simpler, but too slow solution
We have tried to just analyze a copy_to field (autocomplete) with a shingles analyzer, and then do a terms aggregation with a regex include-filter to remove the terms that don't start with the search phrase, but that is just way too slow and memory hungry, as the number of irrelevant terms (to a specific query) for each field is just too great.

Search: "fabulo"

{
  "size": 0,
  "aggs": {
    "autocomplete": {
      "terms": {
        "field": "autocomplete",
        "include": {
          "pattern": "fabulo(.*)"
        }
      }
    }
  },
  "query": {
    "prefix": {
      "autocomplete": {
        "value": "fabulo"
      }
    }
  }
}

Basing suggestions on previous searches
We are working on basing suggestions on previous search phrases, but a new user will need to have some autocomplete suggestions based on content as well, if they have very few user-generated searches.

Question:
Is there any way to do this faster, simpler, better? ElasticSearch suggesters all seem to require you to know the suggestions in advance or have descriptive titles. Seems very good for product suggestions, but not for large text-content suggestions. Plus, we have the filtering issue to take into account.

Silas Hansen
  • 1,669
  • 2
  • 17
  • 23
  • 1
    Have you looked at edge-nGrams? You can use a search query on the edge-nGrams fields instead of using the suggestions API. It wouldn't be as fast as suggestions can be, but you'll be able to achieve decent response time, IMO. – Archit Saxena Mar 29 '17 at 08:10
  • I am already using edgeNgrams for doing autocomplete suggestions on titles. But for many users, there just aren't enough titles to give any good suggestions and the body content (thousands of words) needs to be used. Do you see that edgeNgrams could help with that? And how would I extract suggestion phrases from that? – Silas Hansen Mar 29 '17 at 09:14
  • If you want to suggest on each keystroke, I'd use edgeN-gram on description field too. Snippets could be generated using a suitable highlighter, perhaps Postings (to return full sentences) – Archit Saxena Mar 29 '17 at 09:18
  • @ArchitSaxena, That is actually a fantastic idea. If this works, you deserve a prize,as it will allow me to get rid of 1 billion documents (not joking)! I will try it out immediately :-) – Silas Hansen Mar 29 '17 at 11:28
  • That's great. :) So what's the prize haha? – Archit Saxena Mar 30 '17 at 08:21
  • Anyway, let me know how it pans out or if you need any more suggestions. I'd be happy to help. – Archit Saxena Mar 30 '17 at 08:47
  • did you manage to get it working? @Silas – Archit Saxena Apr 05 '17 at 07:02
  • I played around with it a bit, but it seems I have to do quite some post-processing of the strings that I get back from the highlighter. I need to experiment more before I can say if it's a real solution or not. Still a very good bet though. – Silas Hansen Apr 07 '17 at 08:02
  • one more suggestion: you can highlight on a standard analyzed field, while the search could be on ngram analyzed field. – Archit Saxena Apr 07 '17 at 10:05
  • @SilasHansen did you find any solution for showing autosuggest. – Sagar Patel Mar 24 '22 at 11:14

1 Answers1

0

we're using combination of shingles and aggregation into a dedicated index:

  1. select all the fields that are supposed to source the autocomplete phrases and add a subfield with shingles filter
"type": "shingle",
            "max_shingle_size": 3,
            "min_shigle_size": 1
          },
  1. Periodically query the index with term aggregation on all those fields, collect the keywords from all aggregations, sum the doc count for each keyword (or phrase in case of 2, 3 word shingles) from all aggregations
  2. put the result keywords in the separate index with the extracted doc count as weight
  3. Elastic now support context field to narrow the suggestion index down, see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#context-suggester
  4. direct the autocomplete search to the separate index
dsmog
  • 71
  • 4
  • Thanks for the suggestion. Can you elaborate on "Elastic now support category field to narrow the suggestion index down"? I'm not sure I know which feature you are referring to exactly. – Silas Hansen Jun 24 '20 at 11:13
  • Ah it's called "context" rather than category. You can add a number of such contexts to each suggestion record to add additional filtering. Beware that the number of contexts is limited in elastic 7. See: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#context-suggester – dsmog Jun 25 '20 at 12:20