2

I'm working on a query based on name fields on Elasticsearch 2.4. The fields I'm interested in are:

  • state
  • city
  • colony

If I send this query:

    {"query": 
        {"bool" : 
            {"must" : [
                {"match" : {"state" : {"query" : "michoacán de ocampo", "type" : "boolean"} } }, 
                {"match" : {"colony" : {"query" : "zamora", "type" : "boolean"} } }, 
                {"match" : {"city" : {"query" : "zamora", "type" : "boolean"} } } 
            ], 
            "filter" : {"term" : {"state" : "michoacán"} } 
        } 
    } }

Results

{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [
            {
                "_id": "71807",
                "_index": "my_place",
                "_score": 8.708784,
                "_source": {
                    "@timestamp": "2019-11-13T15:34:33.373Z",
                    "@version": "1",
                    "city": "Zamora",
                    "city_id": 828,
                    "colony": "Balcones de Zamora",
                    "id": 71807,
                    "state": "Michoacán de Ocampo",
                    "state_id": 16,
                    "type": "place",
                    "zipcode": "59624",
                    "zone_id": null
                },
                "_type": "place"
            },
            {
                "_id": "71762",
                "_index": "my_place",
                "_score": 8.634264,
                "_source": {
                    "@timestamp": "2019-11-13T15:34:33.112Z",
                    "@version": "1",
                    "city": "Zamora",
                    "city_id": 828,
                    "colony": "Zamora de Hidalgo Centro",
                    "id": 71762,
                    "state": "Michoacán de Ocampo",
                    "state_id": 16,
                    "type": "place",
                    "zipcode": "59600",
                    "zone_id": null
                },
                "_type": "place"
            }
        ],
        "max_score": 8.708784,
        "total": 2
    },
    "timed_out": false,
    "took": 5
}

Which are OK

But if I sent the full name of the state in the filter, like this (note the full name "Michoacán de ocampo" in the filter)

    {"query": 
        {"bool" : 
            {"must" : [
                {"match" : {"state" : {"query" : "michoacán de ocampo", "type" : "boolean"} } }, 
                {"match" : {"colony" : {"query" : "zamora", "type" : "boolean"} } }, 
                {"match" : {"city" : {"query" : "zamora", "type" : "boolean"} } } 
            ], 
            "filter" : {"term" : {"state" : "Michoacán de Ocampo"} } 
        } 
    } }

I got these results:

{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [],
        "max_score": null,
        "total": 0
    },
    "timed_out": false,
    "took": 6
}

I need to send the full name in the filter, how can I achieve this or reconfigure my index in order to have the same results?

sirandy
  • 1,834
  • 5
  • 27
  • 32
  • I've shared the postman collection link which you can easily import and test it easily, as it has sample index mapping, sample indexed doc and modified search query. – Amit Jan 27 '20 at 20:36

2 Answers2

2

Update : As OP mentioned in the comment that he is using 2.4, I am updating my solution to include the solution which works for it.

ES 2.4 solution

Index creation with required settings and mappings

{
    "settings": {
        "analysis": {
            "analyzer": {
                "lckeyword": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                }
            }
        }
    },
    "mappings": {
        "so": {
            "properties": {
                "state": {
                    "type": "string"
                },
                "city": {
                    "type": "string"
                },
                "colony": {
                    "type": "string"
                },
                "state_raw": {
                    "type": "string",
                    "analyzer": "lckeyword"
                }
            }
        }
    }
}

Search query

{
    "query": {
        "filtered": {
            "query": {
                "bool": {
                    "should": [
                        {
                            "match": {
                                "state": {
                                    "query": "michoacán de ocampo"
                                }
                            }
                        },
                        {
                            "match": {
                                "colony": {
                                    "query": "zamora"
                                }
                            }
                        },
                        {
                            "match": {
                                "city": {
                                    "query": "zamora"
                                }
                            }
                        }
                    ]
                }
            },
            "filter": {
                "term": {
                    "state_raw": "michoacán de ocampo"
                }
            }
        }
    }
}

An important thing to note here is creating a custom analyzer(keyword with lowercase filter), so that field on which we are creating filter stored as it is but with small letter, as that is what you are passing in your query. Now above query returns you both your document, this is the postman collection that has index creation, sample docs creation and query which return both docs returned.

ES 7.X solution

The issue is that you are defining your state field as text field and then in your filter, you are using [term][1] query which is not analyzed as explained in official ES doc.

Returns documents that contain an exact term in a provided field.

Hence it would try to find token `Michoacán de Ocampo` in inverted index which isn't present as state field is defined as text and generates 3 tokens `michoacán`, `de` and `ocampo` and ES works on token(search term) to token(inverted index) match. You can check these tokens with [analyze API][2] and can use [explain API][3] to see the tokens generated by ES when the query has results

Fix
---
Define `state` field as a [multi-field][4] and store it as it is(kwyword form) so that you can filter on it.

    {
        "mappings": {
            "properties": {
                "state": {
                    "type": "text",
                    "fields": {
                        "raw": {
                            "type": "keyword"
                        }
                    }
                },
                "city": {
                    "type": "text"
                },
                "colony": {
                    "type": "text"
                }
            }
        }
    }

Now below query would give you both results.

    {
        "query": {
            "bool": {
                "must": [
                    {
                        "match": {
                            "state": {
                                "query": "michoacán de ocampo"
                            }
                        }
                    },
                    {
                        "match": {
                            "colony": {
                                "query": "zamora"
                            }
                        }
                    },
                    {
                        "match": {
                            "city": {
                                "query": "zamora"
                            }
                        }
                    }
                ],
                "filter": {
                    "term": {
                        "state.raw": "Michoacán de Ocampo" -->notice .raw to search on keyword field.
                    }
                }
            }
        }
    }

EDIT: - https://www.getpostman.com/collections/f4b9ed00d50e2f4bc7f4 is the postman collection link if you want to quickly test it.

Amit
  • 30,756
  • 6
  • 57
  • 88
  • I understand perfectly your answer, as you said the definition of the field state is this: `"state": { "type": "string" },` (as you can see I'm using a very old version of ES). Which could be the implication of changing this index definition? – sirandy Jan 28 '20 at 00:02
  • @sirandy which version of ES you are using? even in case of old version of ES like in 1.X instead of using multi-field, you can add one more field to store the keyword equivalent of state field which is `state.raw` in our case and then apply filter on it. Also adding new field isn't breaking change so you don't have to re-index the data again but old documents will not get the keyword tokens, so later on when you update these doc then they will have it and show up in your results. I have done this several times and it called as incremental changes, let me know if you need more info. – Amit Jan 28 '20 at 02:32
  • Interesting approach, the version of ES is 2.4. I did it in this way: `"state": {"type": "string", "fields": {"raw": {"type": "string"}}}` I put string in type because ES 2.4 doesn't support "keyword" as type, and then I executed the query in this way: `"filter": [ { "term": { "state.raw": "michoacán de ocampo" }} ]`. But no success :( – sirandy Jan 28 '20 at 16:33
  • @sirandy finally had to spend so much time to make it work for you :)... please have a look at my updated answer and postman collection – Amit Jan 29 '20 at 11:55
  • Ohh! an interesting approach! At first, I had problems because I forgot to update my documents with the new field `state_raw`. As long as I understand, I need to create that property to all documents and then reindex, right? – sirandy Jan 29 '20 at 16:05
  • 1
    @sirandy, yes you are right, you need to re-index after you create this property to relfect the changes in all your doc and gald finally it worked for u :-) – Amit Jan 29 '20 at 16:19
  • @sirandy going further please mention the version of ES as it created unwanted delay and I had to spend so much time to check it in older versions and even in queries there were some breaking changes :(.. normally people woudn't spend so much time, hence its advisable to give accurate information in first place, hope you understand my point :-) – Amit Jan 29 '20 at 16:23
  • For sure, I've edited the question and also the tags in order to provide more information. I really appreciate your effort and thanks for going beyond the call of duty. Maybe you already did this, but did you use Docker for this task? – sirandy Jan 29 '20 at 16:32
  • @sirandy thanks for editing the question, it would make it more searchable and helpful for community, no I didn't use docker for this task. – Amit Jan 29 '20 at 16:41
1

my guess is that the mapping of your state field is the default one, i.e., state is a text field, with a keyword sub-field (see dynamic field mapping).

If this is the case, then the filter of your first query "works" because it matches one of the tokens created by the default text analyzers. In fact, "Michoacán de Ocampo" is processed into these three lowercase tokens: ["michoacán", "de", "ocampo" ].

For the same reason, the second filter cannot match, because you are keeping the phrase "Michoacán de Ocampo" with the case. What should work is the following query:

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "state": {
              "query": "michoacán de ocampo"
            }
          }
        },
        {
          "match": {
            "colony": {
              "query": "zamora"
            }
          }
        },
        {
          "match": {
            "city": {
              "query": "zamora"
            }
          }
        }
      ],
      "filter": {
        "term": {
          "state.keyword": "Michoacán de Ocampo"
        }
      }
    }
  }
}
glenacota
  • 2,314
  • 1
  • 11
  • 18