I followed Rohit's suggestion, and implemented an Elasticsearch script filter. You will need to configure your Elasticsearch server to allow dynamic (inline) Groovy scripts.
Here's the code for the Groovy script filter:
def label_map = labels.collectEntries { entry -> [entry, 1] };
def count = 0;
for (def label : doc['label'].values) {
if (!label_map.containsKey(label)) {
return 0
} else {
count += 1
}
};
return count
To use it in an Elasticsearch query, you either need to escape all the newline characters, or place the script on one line like this:
def label_map = labels.collectEntries { entry -> [entry, 1] }; def count = 0; for (def label : doc['label'].values) { if (!label_map.containsKey(label)) { return 0 } else { count += 1 } }; return count
Here's an Elasticsearch query that's very similar to what I did, including the script filter:
POST /documents/_search
{
"fields": [
"id",
"name",
"label",
"description"
],
"query": {
"function_score": {
"query": {
"filtered": {
"query": {
"bool": {
"minimum_should_match": 1,
"should" : {
"term" : {
"description" : "fine"
}
}
}
},
"filter": {
"script": {
"script": "def label_map = labels.collectEntries { entry -> [entry, 1] }; def count = 0; for (def label : doc['label'].values) { if (!label_map.containsKey(label)) { return 0 } else { count += 1 } }; return count",
"lang": "groovy",
"params": {
"labels": [
"foo",
"bar",
"qux",
"zip",
"baz"
]
}
}
}
}
},
"functions": [
{
"filter": {
"query": {
"match": {
"label": "qux"
}
}
},
"boost_factor": 25
}
],
"score_mode": "multiply"
}
},
"size": 10
}
My actual query required combining the script filter with a function score query, which was hard to figure out how to do, so I'm including it here as an example.
What this does is use the script filter to select documents whose labels are a subset of the labels passed in the query. For my use case (thousands of documents, not millions) this works very quickly - tens of milliseconds.
The first time the script is used, it takes a long time (about 1000 ms), probably due to compilation and caching. But later invocations are 100 times faster.
A couple of notes:
- I used the Sense console Chrome plugin to debug the Elasticsearch query. Much better than using curl on the commandline! (Note that Sense is now part of Marvel, so you can also get it there.
- To implement the Groovy script, I first installed the Groovy language on my laptop, and wrote some unit tests, and implemented the script. Once I was sure that the script was working, I formatted it to fit on one line and put it into Sense.