I have a large ES index which I intend to populate using various sources. The sources sometimes have the same documents, meaning that I will have duplicate docs differing only by 'source' param.
To perform de-duplication when serving searches, I see 2 ways:
- Get Elasticsearch to perform the priority filtering.
- Get everything and filter via Python
I prefer not to filter at Python level to preserve pagination, so I want to ask if there's a way to tell Elasticsearch to priority filter based on some value in the document (in my case, source).
I want to filter by simple priority (so if my order is A,B,C, I will serve the A document if it exists, then B if doc from source A doesn't exist, followed by C).
An example set of duplicate docs would look like:
{
"id": 1,
"source": "A",
"rest_of": "data",
...
},
{
"id": 1,
"source": "B",
"rest_of": "data",
...
},
{
"id": 1,
"source": "C",
"rest_of": "data",
...
}
But if I want to serve "A" FIRST, then "B" if no "A", followed by "C" if no "B", a search result for "id": 1 will look like:
{
"id": 1,
"source": "A",
"rest_of": "data",
...
}
Note: Alternatively, I could try to de-duplicate at the population phase, but I'm worried about the performance. Willing to explore this if there's no trivial way to implement solution 1.