Anyone know how to filter misspellings from the suggest result set?
This query finds good suggestions but also includes partial misspellings. e.g. "comercial morgage" returns "commercial mortgage", which is good, but also "comercial mortgage", which is bad because the comercial term is still wrong.
{
"suggest" : {
"text" : "comercial morgage",
"simple_phrase" : {
"phrase" : {
"analyzer" : "standard",
"field" : "title.raw",
"max_errors" : 0.8,
"size" : 3,
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
},
"collate": {
"query": {
"match": {
"title.raw" : "{{suggestion}}"
}
},
"prune": true
}
}
}
}
}
This returns
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": { ... },
"suggest": {
"simple_phrase": [
{
"text": "comercial morgage",
"offset": 0,
"length": 17,
"options": [
{
"text": "commercial mortgage",
"highlighted": "<em>commercial mortgage</em>",
"score": 0.0025874644,
"collate_match": true
},
{
"text": "commercial mortgages",
"highlighted": "<em>commercial mortgages</em>",
"score": 0.0022214006,
"collate_match": true
},
{
"text": "comercial mortgage",
"highlighted": "comercial <em>mortgage</em>",
"score": 0.0019709675,
"collate_match": true
}
]
}
]
}
}
The collate_match for "comercial [em]mortgage[/em]" is true even though this exact phrase does not appear in any document title.
The scores are quite low and very similar so I can't filter by a score.
Currently it looks OK on the final page because I use a little javascript to show only results the are surrounded by the [em/] tag but this is a hack and not very nice.
The version of elasticsearch is 1.5.3 but we will probably upgrade soon so I can't use filters in a suggestion.
Does anyone know how to filter/prune any suggestions that do not exist in the title.raw field?
Thanks.