We're using Elasticsearch to store much of our data, but we're running into issues where we have to subset the search by a large number of potential values, for instance due to permission controls we want to have take effect.
When you're building an Elasticsearch query, there's a limit around 65k for the number of individual values that can be searched for in a given field. That's causing problems for us.
Here's a more concrete example. This is a little contrived, but I think gets the problem across. Imagine that we have an index called people
with the fields name
, person_id
, and age
. And we don't just have a few people, we have 1 billion people in the index.
Then we have another index called stores
with the fields store_id
, name
, address
, and person_ids
, where person_ids
represents the people who have bought something at a given store.
Now, we want to do a search for, say, people called "Emily" who are over 35 and have shopped at a given store. Most of the time when we want to do a query like this, there's no problem because a typical store (in this hypothetical) has under 1000 customers who've ever been there.
But imagine some of the stores are huge, and have 1 million people who've visited them. Searching against that subset of data is impossible the naïve way because adding 1 million person_id
values to the query against people
way exceeds the ca. 65k item limit for a predicate.
The naïve solution here would be to denormalize the data. We could insert a list of store_ids
in each of the people
documents, and then search for people called "Emily" over 35 with an added search term for the given store_id
. We don't want to do this because you then have the obvious issues of keeping the denormalized data in sync as people shop at new stores.
We could also move the data to a relational database, which solves the implicit join issue, but there we run into issues because we want to execute queries with fuzzy text matching, or other more complicated matching procedures.
Thus the question: is there a clever way to subset an Elasticsearch query against a big predicate like this?