I'm attempting to configure Elasticsearch (version 6.4) so it's possible to do full text search on documents that may contain chemical names using a number of chemical synonyms. The synonym terms can:
- be multi-word (i.e. contain spaces)
- contain hyphens
- contain parentheses
- contain commas
Can anyone help me come up with a configuration that meets these requirements?
The index config I have at the moment looks like this:
PUT /documents
{
"settings": {
"analysis": {
"analyzer": {
"chemical_synonyms": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase","chem_synonyms"]
},
"lower": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
},
"filter": {
"chem_synonyms": {
"type": "synonym_graph",
"synonyms":[
"N\\,N-Bis(2-hydroxyethyl)amine, Niax DEOA-LF, 111-42-2"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
},
"raw": {
"type": "text",
"analyzer":"lower"
}
}
}
}
}
}
}
This config contains a single line of SOLR style synonyms. In reality there are more and they come from a file, but the jist is the same.
Assume I have three documents:
PUT /documents/doc/1
{"text": "N,N-Bis(2-hydroxyethyl)amine"}
PUT /documents/doc/2
{"text": "Niax DEOA-LF"}
PUT /documents/doc/3
{"text": "111-42-2"}
If I run a search using this config:
POST /documents/_search
{
"query": {
"bool": {
"should": [
{
"query_string": {
"default_operator": "AND",
"type": "cross_fields",
"query": "\"N,N-Bis(2-hydroxyethyl)amine\""
}
},
{
"query_string": {
"default_operator": "AND",
"default_field": "*.raw",
"analyzer": "chemical_synonyms",
"query": "\"N,N-Bis(2-hydroxyethyl)amine\""
}
}
]
}
}
}
I would expect it to match all three documents, however it's currently not matching document 2. Changing the query to "111-42-2" also fails to match document 2. Searching for "Niax DEOA-LF" correctly matches all three.
How can I change either my index config or my search query (or both) so that a search for any one of these synonym terms will match all documents that contain any other of the synonym terms? Also normal full text searching must also continue to work so any changes can't prevent standard text searching of non-synonym terms from working.