0

I have a pretty complex query that is running in the search API (elastic python client search API) for very large amount of phrases with some other constriens

            query = {
            "query": {
                "bool": {
                    "should": [
                        {
                            "bool": {
                                "must": [
                                    {"terms": {"page_id": chunked_pages[entity]}},
                                    {
                                        "bool": {
                                            "should": [
                                                {"match_phrase": {"content": {"query": name, "slop": 6}}}
                                                for name in chunked_names[entity]
                                            ]
                                        }
                                    },
                                ]
                            }
                        }
                        for entity in chunked_names.keys()
                    ],
                    "minimum_should_match": 1,
                }
            },
            "highlight": {
                "fields": {
                    "content": {}
                },
                "pre_tags": ["<em>"],
                "post_tags": ["</em>"],
            },
            "from": from_param,
            "size": results_per_request
        } 
response = es.search(index=index_name, body=query)

And for each retrieved document I would like to know what phrase has been found there (since there are thousands of potential phrases ). I tried using the highlight but I am getting outputs that suggest that the highlight feature is mixing between bool clauses, while the document is correct, the highlight terms are not related (breaking the page_ids constraints)

Any idea how to deal with it?

Latent
  • 556
  • 1
  • 9
  • 23
  • I've test some data with your query, the result is correct. Can you give some examples including your input and output ? – Mathew Apr 11 '23 at 13:24
  • @Mathew it will be too hard to share data. the issue is that i am not getting in the highlights the entire phrase found, it seems that if the phrase is "A B C D" i can get only "A B" while i know C D are there too (by manual check). i want to be able to know which phrase found in each retrieved document since i have >300k phrases . – Latent Apr 16 '23 at 13:54

0 Answers0