Elasticsearch - Count of matches per document

Question

I'm using this query to search a field for occurrences of phrases.

"query": {
    "match_phrase": {
       "content": "my test phrase"
  }
 }

I need to calculate how many matches occurred for each phrase per document (if this is even possible?)

I've considered aggregators but think these don't meet the requirements as these will give me the number of matches over the whole index not per document.

Thanks.

I can't think of anything better than the answers you got at https://discuss.elastic.co/t/count-of-phrase-matches-per-document/96762 . If you have a better solution, please post it here: I am looking for the same thing. — Rich, May 01 '19 at 15:45
You could perhaps use highlighting, with "number_of_fragments" set to a high number and count the number of fragments returned? https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html — Rich, May 01 '19 at 15:46
The question is slightly ambiguous, as Elasticsearch will count a single word match as a hit, e.g. "phrase" will match "my test phrase", so the answer to "how many matches occurred for each phrase per document" is not completely clear in the case when the phrase has matched both in full and in part in the same document. — Rich, May 01 '19 at 15:49

Polynomial Proton · Answer 1 · 2019-05-02T16:47:13.477

This can be achieved by using Script Fields /painless script.

You can count the number of occurrences per field and add it up for the document.

Example:

## Here's my test index with some sample values

POST t1/doc/1  <-- this has one occurence
{
  "content" : "my test phrase"
}

POST t1/doc/2    <-- this document has 5 occurences
{
   "content": "my test phrase ",
   "content1" : "this is my test phrase 1",
   "content2" : "this is my test phrase 2",
   "content3" : "this is my test phrase 3",
   "content4" : "this is my test phrase 4"

}

POST t1/doc/3
{
  "content" : "my test new phrase"
}

Now using the script I can count the phrase match for each field. I'm counting it once per field, but you can modify script to multi match per field.

Obviously, the Drawback here is that you need to mention each and every field from the document in the script, unless there's a way to loop through doc field that i am not aware of.

POST t1/_search
{
  "script_fields": {
    "phrase_Count": {
      "script": {
        "lang": "painless",
        "source": """
                             int count = 0;

                            if(doc['content.keyword'].size() > 0 && doc['content.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content1.keyword'].size() > 0 && doc['content1.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content2.keyword'].size() > 0 && doc['content2.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content3.keyword'].size() > 0 && doc['content3.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content4.keyword'].size() > 0 && doc['content4.keyword'].value.indexOf(params.phrase)!=-1) count++;

                            return count;
""",
        "params": {
          "phrase": "my test phrase"
        }
      }
    }
  }
}

This will give me the phrase count per document as a scripted field

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            5                 <--- count of occurrences of the phrase in the document
          ]
        }
      },
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            1
          ]
        }
      },
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            0
          ]
        }
      }
    ]
  }
}

How about query matches which are not simple substrings. Let's say my search is a proximity search like `"my phrase"~3` and I want to count the matches? — Rich, May 02 '19 at 11:03
@Rich That would require some code changes. Painless is build on top of Java and i'm sure you can [rewrite the script to achieve proximity search](https://stackoverflow.com/questions/45631391/java-string-search-in-proximity-manner) — Polynomial Proton, May 02 '19 at 16:40
Thanks for your reply, but this isn't the solution I am looking for. I do not want to attempt to reproduce all of the Elasticsearch / Lucene query syntax parsing code in the "painless" scripting language. I am looking for an answer which uses Elasticsearch's code for that, like the `explain` based beginning of an answer from https://discuss.elastic.co/t/count-of-phrase-matches-per-document/96762 — Rich, May 15 '19 at 10:04

score -1 · Answer 2 · answered May 01 '19 at 17:04

-1

You can use Term Vectors to achieve this functionality. Please have a look Term Vectors

answered May 01 '19 at 17:04

Abdullah Ahsan

107
10

1

Please provide an example of how they would use them instead of just a link. – Jeremy W May 01 '19 at 17:21
1

I don't think that would work: I am doing a "`match`" query, not a "`term`" query, and I don't want to fetch all term frequencies from my document (which may be large), but just a count of the number of matches against my search phrase. I may have misunderstood you though, please could you give an example? – Rich May 01 '19 at 19:38

Elasticsearch - Count of matches per document

2 Answers2

Linked