6

I am trying to figure out how to solve these two problems that I have with my ES 5.6 index.

"mappings": {
    "my_test": {
        "properties": {
            "Employee": {
                "type": "nested",
                "properties": {
                    "Name": {
                        "type": "keyword",
                        "normalizer": "lowercase_normalizer"
                    },
                    "Surname": {
                        "type": "keyword",
                        "normalizer": "lowercase_normalizer"
                    }
                }
            }
        }
    }
}

I need to create two separate scripted filters:

1 - Filter documents where size of employee array is == 3

2 - Filter documents where the first element of the array has "Name" == "John"

I was trying to make some first steps, but I am unable to iterate over the list. I always have a null pointer exception error.

{
  "bool": {
    "must": {
      "nested": {
        "path": "Employee",
        "query": {
          "bool": {
            "filter": [
              {
                "script": {
                  "script" :     """

                   int array_length = 0; 
                   for(int i = 0; i < params._source['Employee'].length; i++) 
                   {                              
                    array_length +=1; 
                   } 
                   if(array_length == 3)
                   {
                     return true
                   } else 
                   {
                     return false
                   }

                     """
                }
              }
            ]
          }
        }
      }
    }
  }
}
betto86
  • 694
  • 1
  • 8
  • 23

2 Answers2

4

As Val noticed, you cant access _source of documents in script queries in recent versions of Elasticsearch. But elasticsearch allow you to access this _source in the "score context".

So a possible workaround ( but you need to be careful about the performance ) is to use a scripted score combined with a min_score in your query.

You can find an example of this behavior in this stack overflow post Query documents by sum of nested field values in elasticsearch .

In your case a query like this can do the job :

POST <your_index>/_search
{
      "min_score": 0.1,
      "query": {
        "function_score": {
          "query": {
            "match_all": {}
          },
          "functions": [
            {
              "script_score": {
                "script": {
                  "source": """
                    if (params["_source"]["Employee"].length === params.nbEmployee) {
                      def firstEmployee = params._source["Employee"].get(0);
                      if (firstEmployee.Name == params.name) {
                        return 1;
                      } else {
                        return 0;
                      }
                    } else {
                      return 0;
                    }
                  """,
                  "params": {
                    "nbEmployee": 3,
                    "name": "John"
                  }
                }
              }
            }
          ]
        }
      }
}

The number of Employee and first name should be set in the params to avoid script recompilation for every use case of this script.

But remember it can be very heavy on your cluster as Val already mentioned. You should narrow the set a document on which your will apply the script by adding filters in the function_score query ( match_all in my example ). And in any case, it is not the way Elasticsearch should be used and you cant expect bright performances with such a hacked query.

Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
Pierre Mallet
  • 7,053
  • 2
  • 19
  • 30
  • I didn't care to mention this possibility as just because it's possible doesn't mean it's right to do so ;-) To me the right way to do it is to define an adequate mapping to support the use case(s). – Val Jul 10 '19 at 05:15
  • I agree with @Val but since it seems impossible to change mapping, here is the last hope :p – Pierre Mallet Jul 10 '19 at 08:01
  • Use cases change, needs evolve, hence it's not reasonable to set everything in stone, i.e. nothing's impossible ;-) – Val Jul 10 '19 at 08:40
1

1 - Filter documents where size of employee array is == 3

For the first problem, the best thing to do is to add another root-level field (e.g. NbEmployees) that contains the number of items in the Employee array so that you can use a range query and not a costly script query.

Then, whenever you modify the Employee array, you also update that NbEmployees field accordingly. Much more efficient!

2 - Filter documents where the first element of the array has "Name" == "John"

Regarding this one, you need to know that nested fields are separate (hidden) documents in Lucene, so there is no way to get access to all the nested docs at once in the same query.

If you know you need to check the first employee's name in your queries, just add another root-level field FirstEmployeeName and run your query on that one.

Val
  • 207,596
  • 13
  • 358
  • 360
  • Thank you Val for answering. Unfortunately I can't add new fields in the mapping. So I need to use a script. Regarding the second point, are you saying that there is no way to iterate over a list of nested object in painless? – betto86 Jul 08 '19 at 15:24
  • You can't add new fields because your mapping is defined with `dynamic: strict/false`? – Val Jul 08 '19 at 15:27
  • Nope, for internal policy in our systems – betto86 Jul 08 '19 at 15:29
  • As far as I know, there's no way to access the `_source` field in `script` queries. It was once possible, but that possibility has been removed as it was a big performance bottleneck, since all documents have to be evaluated through the script in the order for the query to run. If you have a large document body, this can kill your performance and eventually your cluster as well. – Val Jul 08 '19 at 16:08