I am currently starting out with Elasticsearch. I've indexed a few EDIFACT messages (a pre-historic data format;-) The content looks something like this:
UNB+UNOA:2+SENDER+RECEIVER+170509:0050+152538'
UNH+66304+CODECO:D:95B:UN:ITG12'
BGM+34+INGATE OF UCN ABCD+9'
When I do a search for the phrase UNH+66304+CODECO:D:95B it should only return one hit but it seems it is returning all files that contain any of these words (and UNH is in every single one of the documents). My Query is this:
curl -XGET --netrc-file ~/curl_user 'localhost:9200/edi/message/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query":{
"match":{"MESSAGE":"UNH+66304+CODECO:D:95B"}
}
}'
I've tried to add the "and" operator like this:
"match":{
"MESSAGE":{
"query":"UNH+66304+CODECO",
"operator": "and"
}
}
But then no results are returned. I've read the suggestion here: Searching for exact phrase that I need to use double quotes. I've tried both "query":"'UNH+66304+CODECO'" and "query":"\"UNH+66304+CODECO\"" but it doesn't make a difference.
I have also tried match_phrase
"match_phrase":{
"MESSAGE":{
"query":"UNH+66304+CODECO"
}
}
does not return a result while
"match_phrase":{
"MESSAGE":{
"query":"UNH+66304"
}
}
does. With normal text it seems to work but somehow Elasticsearch doesn't like it with the +: etc in the search string (that is unfortunately part of EDIFACT).
How to make query_string search exact phrase in ElasticSearch talks about using a different analyser if you want exact matches?
Update: abhishek mishra confirmed that the Analyser is probably the way to go. I am using Elasticsearch 5.4 and there are a lot of Analysers to chose from: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
The Keyword Analyser would probably map to what abhishek suggested as the 'not analysed' as it is a noop Analyser. However I am a bit worried about using this as the messages can be quite long. What are the performance impacts for the search? If I use the Keyword Analyser will I still be able to search for parts of the whole message?
I am wondering whether the Pattern Analyser would be a good fit? EDIFACT messages consist of segments starting with 3 Upper Case Characters and are terminated by ' (but you can escape ' by prefixing it with ?)
FTX+AAA++It?'s a strange data format'
FTX+AAA++Yes it is'
So the example above would be two segments. If I would use a pattern that separates splits these segments, would that be a good match?
Only problem is that currently the MESSAGE field can contain EDIFACT messages and XML messages. Using the same Pattern Analyser would not work I guess so I would have to create two different types depending on the content of the MESSAGE field (all the rest is the same).
2nd Update: I have followed the advice to look into analysers. I thought the keyword analyser is probably not a good idea as the text can be quite long. I've found that the pattern analyser (without any custom pattern) works quite nicely. It splits up everything on : and +. Searches like
{
"query":{
"match_phrase":{"MESSAGE":"RFF+ABT:ATB150538080520172452"}
}
}
or
{
"query":{
"match_phrase":{"MESSAGE":"RFF+ABT:ATB150538080520172452"}
}
}
work now. The problem before was that e.g. was split up into [rff,abt:atb150538080520172452].