9

I'm using this query to search a field for occurrences of phrases.

"query": {
    "match_phrase": {
       "content": "my test phrase"
  }
 }

I need to calculate how many matches occurred for each phrase per document (if this is even possible?)

I've considered aggregators but think these don't meet the requirements as these will give me the number of matches over the whole index not per document.

Thanks.

Polynomial Proton
  • 5,020
  • 20
  • 37
  • I can't think of anything better than the answers you got at https://discuss.elastic.co/t/count-of-phrase-matches-per-document/96762 . If you have a better solution, please post it here: I am looking for the same thing. – Rich May 01 '19 at 15:45
  • You could perhaps use highlighting, with "number_of_fragments" set to a high number and count the number of fragments returned? https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html – Rich May 01 '19 at 15:46
  • The question is slightly ambiguous, as Elasticsearch will count a single word match as a hit, e.g. "phrase" will match "my test phrase", so the answer to "how many matches occurred for each phrase per document" is not completely clear in the case when the phrase has matched both in full and in part in the same document. – Rich May 01 '19 at 15:49

2 Answers2

7

This can be achieved by using Script Fields /painless script.

You can count the number of occurrences per field and add it up for the document.

Example:

## Here's my test index with some sample values

POST t1/doc/1  <-- this has one occurence
{
  "content" : "my test phrase"
}

POST t1/doc/2    <-- this document has 5 occurences
{
   "content": "my test phrase ",
   "content1" : "this is my test phrase 1",
   "content2" : "this is my test phrase 2",
   "content3" : "this is my test phrase 3",
   "content4" : "this is my test phrase 4"

}

POST t1/doc/3
{
  "content" : "my test new phrase"
}

Now using the script I can count the phrase match for each field. I'm counting it once per field, but you can modify script to multi match per field.

Obviously, the Drawback here is that you need to mention each and every field from the document in the script, unless there's a way to loop through doc field that i am not aware of.

POST t1/_search
{
  "script_fields": {
    "phrase_Count": {
      "script": {
        "lang": "painless",
        "source": """
                             int count = 0;

                            if(doc['content.keyword'].size() > 0 && doc['content.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content1.keyword'].size() > 0 && doc['content1.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content2.keyword'].size() > 0 && doc['content2.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content3.keyword'].size() > 0 && doc['content3.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content4.keyword'].size() > 0 && doc['content4.keyword'].value.indexOf(params.phrase)!=-1) count++;

                            return count;
""",
        "params": {
          "phrase": "my test phrase"
        }
      }
    }
  }
}

This will give me the phrase count per document as a scripted field

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            5                 <--- count of occurrences of the phrase in the document
          ]
        }
      },
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            1
          ]
        }
      },
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            0
          ]
        }
      }
    ]
  }
}
Polynomial Proton
  • 5,020
  • 20
  • 37
  • How about query matches which are not simple substrings. Let's say my search is a proximity search like `"my phrase"~3` and I want to count the matches? – Rich May 02 '19 at 11:03
  • @Rich That would require some code changes. Painless is build on top of Java and i'm sure you can [rewrite the script to achieve proximity search](https://stackoverflow.com/questions/45631391/java-string-search-in-proximity-manner) – Polynomial Proton May 02 '19 at 16:40
  • Thanks for your reply, but this isn't the solution I am looking for. I do not want to attempt to reproduce all of the Elasticsearch / Lucene query syntax parsing code in the "painless" scripting language. I am looking for an answer which uses Elasticsearch's code for that, like the `explain` based beginning of an answer from https://discuss.elastic.co/t/count-of-phrase-matches-per-document/96762 – Rich May 15 '19 at 10:04
-1

You can use Term Vectors to achieve this functionality. Please have a look Term Vectors

  • 1
    Please provide an example of how they would use them instead of just a link. – Jeremy W May 01 '19 at 17:21
  • 1
    I don't think that would work: I am doing a "`match`" query, not a "`term`" query, and I don't want to fetch all term frequencies from my document (which may be large), but just a count of the number of matches against my search phrase. I may have misunderstood you though, please could you give an example? – Rich May 01 '19 at 19:38