0

Using elasticsearch 7, I'm trying to use a simple query string query for searches over different fields, both text and keyword. Here's a minimum, reproducible example to show the initial setup and problem:

mapping.json:

{
    "dynamic": false,
    "properties": {
        "publicId": {
            "type": "keyword"
        },
        "eventDate": {
            "type": "date",
            "format": "yyyy-MM-dd",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        },
        "name": {
            "type": "text"
        }
    }
}

test-data1.json:

{
    "publicId": "a1b2c3",
    "eventDate": "2022-06-10",
    "name": "Research & Development"
}

test-data2.json

{
    "publicId": "d4e5f6",
    "eventDate": "2021-05-11",
    "name": "F.inance"
}

Create index on ES running on localhost:19200:

#!/bin/bash -e

host=${1-localhost:19200}
dir=$( dirname `readlink -f $0` )

mapping=$(<${dir}/mapping.json);

param="{ \"mappings\": $mapping}"

curl -XPUT "http://${host}/test/" -H 'Content-Type: application/json' -d "$param"
curl -XPOST "http://${host}/test/_doc/a1b2c3" -H 'Content-Type: application/json' -d @${dir}/test-data1.json
curl -XPOST "http://${host}/test/_doc/d4e5f6" -H 'Content-Type: application/json' -d @${dir}/test-data2.json

Now the task is to support searches like "Research & Development", "Research & Development 2022-06-10", "Finance" (note the removed dot) or simply "a1b2c3". For example using a query like this:

{
    "from": 0,
    "size": 20,
    "query": {
        "bool": {
            "must": [
                {
                    "simple_query_string": {
                        "query": "Research & Development 2022-06-10",
                        "fields": [
                            "publicId^1.0",
                            "eventDate.keyword^1.0",
                            "name^1.0"
                        ],
                        "flags": -1,
                        "default_operator": "and",
                        "analyze_wildcard": false,
                        "auto_generate_synonyms_phrase_query": true,
                        "fuzzy_prefix_length": 0,
                        "fuzzy_max_expansions": 50,
                        "fuzzy_transpositions": true,
                        "boost": 1.0
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1.0
        }
    },
    "version": true
}

The problem with this setup is that the standard analyzer for the text field that removes most punctuation of course also removes the ampersand character. The simple query string query splits the query into three tokens [research, &, development] and searches over all fields using the and operator. There are two matches ("Research" and "Development") for the name text field, but no matches for the ampersand in any field. Thus the result is empty.

Now I came up with a solution to add a second field for name with a different analyzer, the whitespace analyzer, that doesn't remove punctuation:

{
    "dynamic": false,
    "properties": {
        "publicId": {
            "type": "keyword"
        },
        "eventDate": {
            "type": "date",
            "format": "yyyy-MM-dd",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        },
        "name": {
            "type": "text",
            "fields": {
                "whitespace": {
                    "type": "text",
                    "analyzer": "whitespace"
                }
            }
        }
    }
}

This way all searches work, including "Finance" that matches for "F.inance" for the name field. Also, "Research & Development" matches for the name field and for name.whitespace, but most crucially & matches for name.whitespace and therefore returns a result.

My question now is: given the fact that the real setup includes many more fields and a lot of data, adding an additional field and therefore indexing most terms in the same way twice seems quite heavy. Is there a way to only index analyzed terms to name.whitespace that differ from the standard analyzer's terms of name, i.e. that are not in the "parent" field? E.g. "Research & Development" results in the terms [research, development] for name and [research, development, &] for name.whitespace - ideally it would only index [&] for name.whitespace.

Or is there a more elegant/performant solution for this particular problem altogether?

msp
  • 3,272
  • 7
  • 37
  • 49

1 Answers1

0

I guess you can define a dynamic property mapping for all string fields and use whitespace analyzer since your use case has that specification to search on non-standard tokens. In addition, you can specify those fields in the mapping where you don't need whitespace tokenizer.

This would ensure that already mapped fields are analyzed using standard tokenizer while others (dynamic or unmapped fields) are analyzed using whitespace, thus reducing the complexity, field duplication, etc.

Ayush
  • 326
  • 1
  • 5
  • Thanks for the answer, not sure if I follow you correctly: how would the mapping look like in the example I posted in my question? For both `name` and `eventDate` fields? – msp Feb 09 '23 at 10:33
  • I guess there is another way to do this. For indexing, you can still use standard analyzer (default, no need to explicitly set) and for searching, you can set `search_analyzer` for your field as `whitespace`. See: https://www.elastic.co/guide/en/elasticsearch/reference/7.0/search-analyzer.html – Ayush Feb 09 '23 at 12:21
  • 1
    Thanks, I just tried it but that's unfortunately not a solution for my problem. `whitespace` will keep the ampersand in the search query and will bring up an empty search result. Also it *is* necessary to explicitly set an analyzer if you provide a `search_analyzer`, otherwise you'll get the `analyzer on field [name] must be set when search_analyzer is set` error – msp Feb 09 '23 at 13:06