9

Given a phrase match query like this:

{
    'match_phrase': {
        'text.english': {
            'query': "The fox jumped over the wall",
            'phrase_slop': 4,
        }
    }
}

Is there a way I can group results by the exact match?

So if I have 1 document with text.english containing "The quick fox jumps over the small wall" and 3 documents containing "The lazy fox jumped over the big wall", I end up with those two groups of results.

I'm OK with running multiple queries and doing some processing outside of ES, but I need a solution that performs reasonably over a large set of documents. Ideally I'm hoping there's a way to do this using aggregations that I've missed.

The best solution I've come up with is to run the query above with highlights, parse out all of the highlights from all of the results, and group them based on highlight content. This is fine for very small result sets, however over a 1000+ document result set it is prohibitively slow.

EDIT: Maybe I can make this a bit clearer. If I have sample documents with the following values:

  1. "The quick fox jumps over the small wall. Blah blah blah many pages of unrelated text."
  2. "The lazy fox jumped over the big wall. Blah blah blah many pages of unrelated text."
  3. "The lazy fox jumped over the big wall. Blah blah blah many pages of unrelated text."
  4. "The lazy fox jumped over the big wall. Blah blah blah many pages of unrelated text."

I want to be able to group my results as follows with query text "The fox jumped over the wall":

  • "The quick fox jumps over the small wall" - Document 1
  • "The lazy fox jumped over the big wall" - Documents 2, 3, 4
Cole Maclean
  • 5,627
  • 25
  • 37
  • What are you trying to achieve? From those two sample documents, can you explain what should be the desired outcome? – Andrei Stefan Oct 27 '15 at 11:01
  • Ok, so you want your query to match, but the results should be grouped by the text they matched? A simple aggregation on the `text.english.raw` should do it (where `.raw` is a `not_analyzed` subfield). – Andrei Stefan Oct 27 '15 at 11:49
  • Exactly, I want to group the results by the exact match text. I have both an analysed and a raw copy of each doc. How does the aggregation work though? I couldn't find one that would do that. – Cole Maclean Oct 27 '15 at 12:00
  • `"The lazy fox jumped over the big wall"` this is the text that was indexed initially. Do you want to group based on this text or on something else? What if your text has 5 lines, do you want to group on this entire text? – Andrei Stefan Oct 27 '15 at 12:02
  • I want to group based on the match, not the entire text. – Cole Maclean Oct 27 '15 at 12:03
  • And for `"The lazy fox jumped over the big wall"` what should be the text that matched? `The fox jumped over the wall`? (that's the text you searched) – Andrei Stefan Oct 27 '15 at 12:06
  • The match should be the initially indexed text. "The lazy fox jumped over the big wall". – Cole Maclean Oct 27 '15 at 14:48
  • I think the best option you have is highlighting and a following step of processing the results. Maybe we can improve that slow response, if possible. I'm wondering what query are you using when saying `over a 1000+ document result set it is prohibitively slow`. – Andrei Stefan Oct 27 '15 at 23:34
  • The query itself is not slow, but highlighting is very slow over a lot of results. The dataset is about 1300 documents but they average around 300,000 words, which I think is why the highlighting is taking so long. – Cole Maclean Oct 28 '15 at 11:00
  • Most likely, yes. But, I don't think you have any other option. Highlighting is the only option to bring forth the results that actually matched in a document. – Andrei Stefan Oct 28 '15 at 11:13
  • Yeah, OK. I was hoping there was something else, but thanks for confirming. – Cole Maclean Oct 28 '15 at 11:29
  • @AndreiStefan can you write a quick answer saying highlighting is the only option and I'll accept it? – Cole Maclean Oct 29 '15 at 15:14

4 Answers4

2

If the statements inside your text.english are "exactly" same then their score should be same. You could aggregate results based on Elastic Search _score.

Please refer to this SO question ElasticSearch: aggregation on _score field?

Since ES has disabled the dynamic scripting, this might help. ElasticSearch: aggregation on _score field w/ Groovy disabled

Community
  • 1
  • 1
ChintanShah25
  • 12,366
  • 3
  • 43
  • 44
  • Thanks, I hadn't thought of that. It's very close, but the problem is that as the text is analyzed and stemmed, I'll have some matches that are different, but score equally (such as the two example phrases above). – Cole Maclean Oct 26 '15 at 15:14
  • Erm, maybe my comment above is misleading. I have stemmed and raw versions of the field indexed. I guess the complexity comes in because I want to match on stemmed, and group the matches by raw. – Cole Maclean Oct 27 '15 at 12:05
  • Just saw your edits. Since you have "Blah blah blah many pages of unrelated text.", the ES `_score` will be different and also you can't use terms aggregation as suggested by other users because of the same reason – ChintanShah25 Oct 27 '15 at 15:30
  • This might not be related, but since you are using highlighting, There is an ongoing issue with highlighted fragments. [highlighting issue](https://github.com/elastic/elasticsearch/issues/9442) . I have personally faced this issue. sorry could not help you much – ChintanShah25 Oct 27 '15 at 15:37
2

In my opinion, highlighting is the only option because it's the only way Elasticsearch will show which "parts" of text matched. And in your case, you want to group documents based on what "matched.

If the text would have been shorter (like few words), maybe a more involved solution would have been to split the text in a shingle-kind of way and somehow group on those phrases... maybe.

But for pages of text, I think the only option is to use highlighting and perform additional steps afterwards to group the highlighted parts.

Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
0

I have a similar problem/challenge in a product search application. I want to group products by brand, e.g.

Nikon
Nikos

To solve this problem I'm experimenting with the Suggester . The idea behind is that the suggester will provide me with suggestions for my searches. The suggestions will be grouped and will not be repeated for all documents (even though there is possibly some other text around them). You can use a Term Suggester or a Phrase Suggester

This approach, however, requires you probably to change the handling of the results. You have to display the suggestions as the groups and handle search results separately. The advantage of this approach is that you don't have to do the grouping yourself.

Another solution is to use a Terms Aggregation using shingles. This aggregation would group word groups (shingles). To get your result, however, you have to take all aggregations and match them with your query input. See example mapping, data and query:

PUT /so
{
   "settings": {
      "analysis": {
         "analyzer": {
            "suggestion_analyzer": {
               "tokenizer": "standard",
               "filter": [
                  "lowercase"
               ]
            },
            "analyzer_shingle": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "filter_shingle"
               ]
            }
         },
         "filter": {
            "filter_shingle": {
               "type": "shingle",
               "min_shingle_size": 4,
               "max_shingle_size": 16,
               "output_unigrams": "false"
            }
         }
      }
   },
   "mappings": {
      "d": {
         "properties": {
            "text": {
               "properties": {
                  "english": {
                     "type": "string",
                     "fields": {
                        "shingles": {
                           "type": "string",
                           "analyzer": "analyzer_shingle"
                        },
                        "suggest": {
                           "type": "completion",
                           "index_analyzer": "analyzer_shingle",
                           "search_analyzer": "analyzer_shingle",
                           "payloads": true
                        }
                     }
                  }
               }
            }
         }
      }
   }
}

Document 1:

POST /so/d/1
{
    "text": {
        "english": "The quick fox jumps over the big wall. JJKJKJKJ"
    }
}

Document 2:

POST /so/d/2
{
    "text": {
        "english": "The quick fox jumps over the small wall. JJKJKJKJ"
    }
}

Document 3:

POST /so/d/3
{
    "text": {
        "english": "The quick fox jumps over the gugus wall. LLLLLLL"
    }
}

Query:

POST /so/_search
{
    "size": 0,
    "query": {
        "match": {
           "text.english": "The quick fox jumps over the wall"
        }
    }, 
    "aggs" : {
        "states" : {
            "terms" : {
                "field" : "text.english.shingles",
                "size": 40
            }
        }
    }
}
paweloque
  • 18,466
  • 26
  • 80
  • 136
-1

I believe you could create a terms aggregation over a not analyzed version of the field.

if text.raw is defined as not_analyzed, an aggregation should take the whole field value.

I have not tested it, but I found something quite similar: ElasticSearch terms aggregation by entire field

Community
  • 1
  • 1
Slomo
  • 1,224
  • 8
  • 11