3

I need to implement the following (on the backend): a user types a query and gets back hits as well as statistics for the hits. Below is a simplified example.

Suppose the query is Grif, then the user gets back (random words just for example)

  • Griffith
  • Griffin
  • Grif
  • Grift
  • Griffins

And frequency + number of documents a certain term occurs in, for example:

  • Griffith (freq 10, 3 docs)
  • Griffin (freq 17, 9 docs)
  • Grif (freq 6, 3 docs)
  • Grift (freq 9, 5 docs)
  • Griffins (freq 11, 4 docs)

I'm relatively new to Elasticsearch, so I'm not sure where to start to implement something like this. What type of query is the most suitable for this? What can I use to get that kind of statistics? Any other advice will be appreciated too.

Sebastian Lore
  • 357
  • 4
  • 23

1 Answers1

1

There are multiple layers to this. You'd need:

  • n-gram / partial / search-as-you-type matching
  • a way to group the matched keywords by their original form
  • a mechanism to reversely look up the document & term frequencies.

I'm not aware of any way to achieve this in one go, but here's my take on it.

  1. You could start off with a special, n-gram-powered analyzer, as explained in my other answer. There's the original content field, plus a multi-field mapping for the said analyzer, plus a keyword field to aggregate on down the line:
PUT my-index
{
  "settings": {
    "index": {
      "max_ngram_diff": 20
    },
    "analysis": {
      "tokenizer": {
        "my_ngrams": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "analyzer": {
        "my_ngrams_analyzer": {
          "tokenizer": "my_ngrams",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "fields": {
          "analyzed": {
            "type": "text",
            "analyzer": "my_ngrams_analyzer"
          },
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
  1. Next, bulk-insert some sample docs containing text inside the content field. Note that each doc has an _id too — you'll need those later on.
POST _bulk
{"index":{"_index":"my-index", "_id":1}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":2}}
{"content":"Griffin"}
{"index":{"_index":"my-index", "_id":3}}
{"content":"Grif"}
{"index":{"_index":"my-index", "_id":4}}
{"content":"Grift"}
{"index":{"_index":"my-index", "_id":5}}
{"content":"Griffins"}
{"index":{"_index":"my-index", "_id":6}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":7}}
{"content":"Griffins"}
  1. Search for n-grams in the .analyzed field and group the matched documents by the original terms through the terms aggregation. At the same time, retrieve the _id of one of the bucketed documents through the top_hits aggregation. BTW — it doesn't matter which _id is returned in a given bucket — all will have contained the same bucketed term.
POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.doc_count,aggregations.*.buckets.*.hits.hits._id
{
  "size": 0, 
  "query": {
    "term": {
      "content.analyzed": "grif"
    }
  },
  "aggs": {
    "full_terms": {
      "terms": {
        "field": "content.keyword",
        "size": 10
      },
      "aggs": {
        "top_doc": {
          "top_hits": {
            "size": 1,
            "_source": false
          }
        }
      }
    }
  }
}
  1. Observe the response. The filter_path URL parameter from the previous request reduces the response to just those attributes that we need — the untouched, original full_terms plus one of the underlying IDs:
{
  "aggregations" : {
    "full_terms" : {
      "buckets" : [
        {
          "key" : "Griffins",
          "doc_count" : 2,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "5"
                }
              ]
            }
          }
        },
        {
          "key" : "Griffith",
          "doc_count" : 2,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "1"
                }
              ]
            }
          }
        },
        {
          "key" : "Grif",
          "doc_count" : 1,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "3"
                }
              ]
            }
          }
        },
        {
          "key" : "Griffin",
          "doc_count" : 1,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "2"
                }
              ]
            }
          }
        },
        {
          "key" : "Grift",
          "doc_count" : 1,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "4"
                }
              ]
            }
          }
        }
      ]
    }
  }
}

Time for the fun part.

There's a specialized Elasticsearch API called Term Vectors which does exactly what you're after — it retrieves field & term stats from the whole index. In order for it to hand these stats over to you, it needs the document IDs — which you'll have obtained from the above aggregation!

  1. Finally, since you've got multiple term vectors to work with, you can use the Multi term vectors API like so — again condensing the response thru filter_path:
POST /my-index/_mtermvectors?filter_path=docs.term_vectors.*.*.*.doc_freq,docs.term_vectors.*.*.*.term_freq
{
  "docs": [
    {
      "_id": "5",                 <--- guaranteeing
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    },
    {
      "_id": "1",                 <--- the response
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    },
    {
      "_id": "3",                 <--- order
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    },
    {
      "_id": "2",
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    },
    {
      "_id": "4",
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    }
  ]
}
  1. The result can be post-processed in your backend to form your autocomplete response. You've got A) the full terms, B) the number of matching documents (doc_freq), and C), the term frequency:
{
  "docs" : [
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Griffins" : {      |      term
              "doc_freq" : 2,   | <--  # of docs
              "term_freq" : 1   |      term frequency
            }
          }
        }
      }
    },
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Griffith" : {
              "doc_freq" : 2,
              "term_freq" : 1
            }
          }
        }
      }
    },
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Grif" : {
              "doc_freq" : 1,
              "term_freq" : 1
            }
          }
        }
      }
    },
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Griffin" : {
              "doc_freq" : 1,
              "term_freq" : 1
            }
          }
        }
      }
    },
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Grift" : {
              "doc_freq" : 1,
              "term_freq" : 1
            }
          }
        }
      }
    }
  ]
}

Shameless plug: if you're new to Elasticsearch and, just like me, learn best from real-world examples, consider buying my Elasticsearch Handbook.

Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
  • What if my `content` field needs to be really big? Is it possible? What are the caveats? – Sebastian Lore Mar 22 '21 at 10:13
  • ES returns `Document contains at least one immense term in field="content.keyword" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '..', original message: bytes can be at most 32766 in length; got 103339` if I try to index `content` of size ~100k characters. – Sebastian Lore Mar 22 '21 at 10:15
  • `.keyword` fields are not designed for such large strings. You could try splitting the content into `content_N`, `content_N+1` etc... – Joe - GMapsBook.com Mar 22 '21 at 10:18
  • So, there's no other ways, only to split into separate fields like `content_part_1`, `content_part_2`... `content_part_N` ? – Sebastian Lore Mar 22 '21 at 10:43
  • To be frank, my answer was rather intended for short, keyword-like strings, not super long text fields. The above approach takes advantage of the fact that you can quickly **aggregate** on the keywords and thus get the most represented ones returned. You can do **without** this aggregation (and without the whole `keyword` mapping) to get rid of this error but will then need some other way to extract the user-defined query matches out of the long text fields -- maybe through [highlighting](https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html). – Joe - GMapsBook.com Mar 22 '21 at 10:55
  • I had been thinking about highlighting before I asked this question. Should I like fuzzy search through the content field, extract matched terms.. How do I get the statistics this way? I mean, it ES will return the `highlight` object like [this](https://pastebin.com/raw/1sbE0jVu) and the hits. I don't get how to establish "relations" between hits and highlighted terms – Sebastian Lore Mar 22 '21 at 11:22
  • Use the Multi Terms Vector API as I did in the 2nd part of my answer. You'll need two requests -- one for the highligting, one for the term vectors. If both are fast enough, the end user won't notice. – Joe - GMapsBook.com Mar 22 '21 at 15:18
  • OK. Just want to be sure I understood everything correctly: 1) I make a search request with highlight feature & get hits 2) Make a Multi Terms Vector request for document ids taken from hits 3) Filter response of (2) using highlighted terms from (1) – Sebastian Lore Mar 22 '21 at 15:51