2

I have documents that have a list of labels:


    {
       "fields": {
          "label": [
               "foo",
               "bar",
               "baz"
          ],
          "name": [
             "Document One"
          ],
          "description" : "A fine first document",
          "id" : 1
       }
    },
    {
       "fields": {
          "label": [
               "foo",
               "dog"
          ],
          "name": [
             "Document Two"
          ],
          "description" : "A fine second document",
          "id" : 2
       }
    }

I have a list of terms:


    [ "foo", "bar", "qux", "zip", "baz"]

I want a query that will return documents that have labels in the list of terms - but no other terms.

So given the list above, the query would return Document One, but not Document Two (because it has the term dog that is not in the list of terms.

I've tried doing a query using a not terms filter, like this:


    POST /documents/_search?size=1000
    {
       "fields": [
          "id",
          "name",
          "label"
       ],
       "filter": {
           "not": {
               "filter" : {
                   "bool" : {
                       "must_not": {
                          "terms": {
                             "label": [
                                "foo",
                                "bar",
                                "qux",
                                "zip",
                                "baz"
                             ]
                          }
                       }
                   }
               }
           }
       }
    }

But that didn't work.

How can I create a query that, given a list of terms, will match documents that only contain terms in the list, and no other terms? In other words, all documents should contain a list of labels that are a subset of the list of supplied terms.

Adam F
  • 1,151
  • 1
  • 11
  • 16

2 Answers2

2

I followed Rohit's suggestion, and implemented an Elasticsearch script filter. You will need to configure your Elasticsearch server to allow dynamic (inline) Groovy scripts.

Here's the code for the Groovy script filter:

def label_map = labels.collectEntries { entry -> [entry, 1] };
def count = 0;

for (def label : doc['label'].values) {
    if (!label_map.containsKey(label)) {
        return 0
    } else {
        count += 1
    }
};

return count

To use it in an Elasticsearch query, you either need to escape all the newline characters, or place the script on one line like this:

def label_map = labels.collectEntries { entry -> [entry, 1] }; def count = 0; for (def label : doc['label'].values) { if (!label_map.containsKey(label)) { return 0 } else { count += 1 } }; return count

Here's an Elasticsearch query that's very similar to what I did, including the script filter:

POST /documents/_search
{
   "fields": [
      "id",
      "name",
      "label",
      "description"
   ],
   "query": {
      "function_score": {
         "query": {
            "filtered": {
               "query": {
                  "bool": {
                     "minimum_should_match": 1,
                     "should" : {
                        "term" : {
                           "description" : "fine" 
                        }
                     }
                 }
               },
               "filter": {
                  "script": {
                     "script": "def label_map = labels.collectEntries { entry -> [entry, 1] }; def count = 0; for (def label : doc['label'].values) { if (!label_map.containsKey(label)) { return 0 } else { count += 1 } }; return count",
                     "lang": "groovy",
                     "params": {
                        "labels": [
                           "foo", 
                           "bar", 
                           "qux", 
                           "zip", 
                           "baz"
                        ]
                     }
                  }
               }
            }
         },
         "functions": [
            {
               "filter": {
                  "query": {
                     "match": {
                        "label": "qux"
                     }
                  }
               },
               "boost_factor": 25
            }
         ],
         "score_mode": "multiply"
      }
   },
   "size": 10
}

My actual query required combining the script filter with a function score query, which was hard to figure out how to do, so I'm including it here as an example.

What this does is use the script filter to select documents whose labels are a subset of the labels passed in the query. For my use case (thousands of documents, not millions) this works very quickly - tens of milliseconds.

The first time the script is used, it takes a long time (about 1000 ms), probably due to compilation and caching. But later invocations are 100 times faster.

A couple of notes:

  • I used the Sense console Chrome plugin to debug the Elasticsearch query. Much better than using curl on the commandline! (Note that Sense is now part of Marvel, so you can also get it there.
  • To implement the Groovy script, I first installed the Groovy language on my laptop, and wrote some unit tests, and implemented the script. Once I was sure that the script was working, I formatted it to fit on one line and put it into Sense.
Adam F
  • 1,151
  • 1
  • 11
  • 16
  • 1
    Before passing it to script, it may make sense to pre-filter documents with term/terms - so that less documents will be processed with script – Alexey Tigarev Apr 11 '16 at 14:06
1

You can script filter to check if the array terms has all the values of label array in a document. I suggest you to make a separate groovy file or plain javascript file, put it in config/scripts/folderToYourScript, and use it in your query infilter: { script : {script_file: file } }

While in script file you can use loop to check the requirement

binariedMe
  • 4,309
  • 1
  • 18
  • 34
  • Rohit, I tried this and it worked. It's also reasonably fast for my use case. I will post the code later. – Adam F Jul 08 '15 at 00:08
  • thanks for reverting back... Do post the code for the help of others. I could have posted the code myself but I don't have a specific problem case. – binariedMe Jul 08 '15 at 08:14