Using elasticsearch 7, I'm trying to use a simple query string query for searches over different fields, both text
and keyword
. Here's a minimum, reproducible example to show the initial setup and problem:
mapping.json:
{
"dynamic": false,
"properties": {
"publicId": {
"type": "keyword"
},
"eventDate": {
"type": "date",
"format": "yyyy-MM-dd",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"name": {
"type": "text"
}
}
}
test-data1.json:
{
"publicId": "a1b2c3",
"eventDate": "2022-06-10",
"name": "Research & Development"
}
test-data2.json
{
"publicId": "d4e5f6",
"eventDate": "2021-05-11",
"name": "F.inance"
}
Create index on ES running on localhost:19200:
#!/bin/bash -e
host=${1-localhost:19200}
dir=$( dirname `readlink -f $0` )
mapping=$(<${dir}/mapping.json);
param="{ \"mappings\": $mapping}"
curl -XPUT "http://${host}/test/" -H 'Content-Type: application/json' -d "$param"
curl -XPOST "http://${host}/test/_doc/a1b2c3" -H 'Content-Type: application/json' -d @${dir}/test-data1.json
curl -XPOST "http://${host}/test/_doc/d4e5f6" -H 'Content-Type: application/json' -d @${dir}/test-data2.json
Now the task is to support searches like "Research & Development", "Research & Development 2022-06-10", "Finance" (note the removed dot) or simply "a1b2c3". For example using a query like this:
{
"from": 0,
"size": 20,
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"query": "Research & Development 2022-06-10",
"fields": [
"publicId^1.0",
"eventDate.keyword^1.0",
"name^1.0"
],
"flags": -1,
"default_operator": "and",
"analyze_wildcard": false,
"auto_generate_synonyms_phrase_query": true,
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"fuzzy_transpositions": true,
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"version": true
}
The problem with this setup is that the standard analyzer for the text field that removes most punctuation of course also removes the ampersand character. The simple query string query splits the query into three tokens [research, &, development]
and searches over all fields using the and
operator. There are two matches ("Research" and "Development") for the name text field, but no matches for the ampersand in any field. Thus the result is empty.
Now I came up with a solution to add a second field for name
with a different analyzer, the whitespace analyzer, that doesn't remove punctuation:
{
"dynamic": false,
"properties": {
"publicId": {
"type": "keyword"
},
"eventDate": {
"type": "date",
"format": "yyyy-MM-dd",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"name": {
"type": "text",
"fields": {
"whitespace": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
This way all searches work, including "Finance" that matches for "F.inance" for the name
field. Also, "Research & Development" matches for the name
field and for name.whitespace
, but most crucially &
matches for name.whitespace
and therefore returns a result.
My question now is: given the fact that the real setup includes many more fields and a lot of data, adding an additional field and therefore indexing most terms in the same way twice seems quite heavy. Is there a way to only index analyzed terms to name.whitespace
that differ from the standard analyzer's terms of name
, i.e. that are not in the "parent" field? E.g. "Research & Development" results in the terms [research, development]
for name
and [research, development, &]
for name.whitespace
- ideally it would only index [&]
for name.whitespace
.
Or is there a more elegant/performant solution for this particular problem altogether?