Elasticsearch subset filter

Question

I have a dataset about books, each of which can be in one or more languages. Every user is registered as having one or more languages.

When a user searches for books, I'd like to return only those books where they understand all of its languages.

For example, the following two books are in the system:

Book A: English, French, German
Book B: English, Greek

If John is registered as knowing English, German, French, and Italian, then his query results should never include Book B.

My system is currently written using Apache Solr, where I ended up writing a plugin to perform a subset operation (where a record matches if the languages of the record are a subset of the languages of the user, where the user's languages are declared in the query).

However, I'd like to transition to an Elasticsearch backend. This particular subsetting behavior, however, doesn't seem to be part of the core filter package. Am I missing something, or should I look at writing a similar plugin / custom filter?

Did you find out if there is a way that ElasticSearch (or any other database / search system) covers this? — Tor, Feb 12 '15 at 00:04
No - I ended up with an alternative implementation, where I query (elasticsearch) for books, and then do a quick bit of java on them to figure out if the user can understand all the languages. If some languages are not understood, the result set is flagged, communicating this to the user. I'd love to hear that there's a way to implement the original design, though! — Ryan Kohl, Feb 12 '15 at 18:02
Here are my own notes so far: https://docs.google.com/document/d/1ngZU89rWc3fQMU8CSSW_yFCyeWEmJzJRngmEl88IrvE. If you or anyone else finds a ready-made solution for this (that potentially could handle large sets), then let me know. I'll certainly come back here myself if I find something :) — Tor, Feb 12 '15 at 22:51

dsathe · Accepted Answer · 2015-04-27T19:14:47.833

This can be done using a script filter , you can pass it a comma separated list of strings as a param and use for loop to ensure each component is contained , if even one is not use break and return false. if all present loop exits and it returns a true.

I'm not sure how efficient this is, but theoretically this can be done on elasticsearch. Ideally apply an optimized filter to narrow down the set of books and then run this on those subsets look at https://www.elastic.co/blog/all-about-elasticsearch-filter-bitsets and docs on post_filters, the efficiency should be ideally tested over a bunch of queries as this filter will preform better once its result begins to be cached

score 0 · Answer 2 · answered Apr 27 '15 at 19:00

Another possible answer to this is to invert the problem on its head.This data has certain characteristics. Assuming sufficient scale and real world practicalities the basic idea is that the cardinality of the language field is extremely low wrt books, users and authors (you could further improve this by using language roots as a field eg Latin- for english, italian and proto languages http://en.wikipedia.org/wiki/List_of_proto-languages at index time) Frequently users tend to know languages from the same family so you can exploit this fact to your benefit.

Then the user query would be essentially be the difference of the sets of all present and the one he knows. These can easily be modeled as a bunch of filters using the execution:bool flag (extremely optimized bitsets internally) to cache and combine them. Make sure you are wise about execution order of filters have a look at https://www.elastic.co/blog/all-about-elasticsearch-filter-bitsets

Elasticsearch subset filter

2 Answers2