Search for exact phrase with Elasticsearch

Question

I am currently starting out with Elasticsearch. I've indexed a few EDIFACT messages (a pre-historic data format;-) The content looks something like this:

UNB+UNOA:2+SENDER+RECEIVER+170509:0050+152538'
UNH+66304+CODECO:D:95B:UN:ITG12'
BGM+34+INGATE OF UCN ABCD+9'

When I do a search for the phrase UNH+66304+CODECO:D:95B it should only return one hit but it seems it is returning all files that contain any of these words (and UNH is in every single one of the documents). My Query is this:

curl -XGET --netrc-file ~/curl_user  'localhost:9200/edi/message/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query":{
        "match":{"MESSAGE":"UNH+66304+CODECO:D:95B"}
    }
}'

I've tried to add the "and" operator like this:

"match":{
              "MESSAGE":{
                "query":"UNH+66304+CODECO",
                "operator": "and"

              }
            }

But then no results are returned. I've read the suggestion here: Searching for exact phrase that I need to use double quotes. I've tried both "query":"'UNH+66304+CODECO'" and "query":"\"UNH+66304+CODECO\"" but it doesn't make a difference.

I have also tried match_phrase

"match_phrase":{
              "MESSAGE":{
                "query":"UNH+66304+CODECO"

              }
            }

does not return a result while

"match_phrase":{
              "MESSAGE":{
                "query":"UNH+66304"

              }
            }

does. With normal text it seems to work but somehow Elasticsearch doesn't like it with the +: etc in the search string (that is unfortunately part of EDIFACT).

How to make query_string search exact phrase in ElasticSearch talks about using a different analyser if you want exact matches?

Update: abhishek mishra confirmed that the Analyser is probably the way to go. I am using Elasticsearch 5.4 and there are a lot of Analysers to chose from: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

The Keyword Analyser would probably map to what abhishek suggested as the 'not analysed' as it is a noop Analyser. However I am a bit worried about using this as the messages can be quite long. What are the performance impacts for the search? If I use the Keyword Analyser will I still be able to search for parts of the whole message?

I am wondering whether the Pattern Analyser would be a good fit? EDIFACT messages consist of segments starting with 3 Upper Case Characters and are terminated by ' (but you can escape ' by prefixing it with ?)

FTX+AAA++It?'s a strange data format'
FTX+AAA++Yes it is'

So the example above would be two segments. If I would use a pattern that separates splits these segments, would that be a good match?

Only problem is that currently the MESSAGE field can contain EDIFACT messages and XML messages. Using the same Pattern Analyser would not work I guess so I would have to create two different types depending on the content of the MESSAGE field (all the rest is the same).

2nd Update: I have followed the advice to look into analysers. I thought the keyword analyser is probably not a good idea as the text can be quite long. I've found that the pattern analyser (without any custom pattern) works quite nicely. It splits up everything on : and +. Searches like

{
    "query":{
        "match_phrase":{"MESSAGE":"RFF+ABT:ATB150538080520172452"}
    }
}

or

{
        "query":{
            "match_phrase":{"MESSAGE":"RFF+ABT:ATB150538080520172452"}
        }
    }

work now. The problem before was that e.g. was split up into [rff,abt:atb150538080520172452].

score 1 · Answer 1 · answered Jun 09 '17 at 04:12

1

You were on the right track about the analyzer. If you look into your type mapping, the property MESSAGE is probably marked as analyzed. This is why when indexing it's getting rid of the special characters. You need to mark it as not_analyzed.

If you let us know what your type mapping looks like I can help you with the correct setting.

One of the examples -

If your ES version is < 5.0 and your type mapping looks similar to this -

{

  "MESSAGE": {
    "type" "string",
    "index": "analyzed"
  }
}

change it to

{
  "MESSAGE": {
    "type" "string",
    "index": "not_analyzed"
  }
}

answered Jun 09 '17 at 04:12

ab m

422
3
17

Thanks for the suggestion, I am using Elasticsearch 5.4 and it seems the API has changed and there are a lot more Analysers to chose from now. I am going to update my question regarding the analyser. – Ben Jun 09 '17 at 07:34
Great. So, does it work now? If yes, then I believe you can provide answer to your own question. It'll be useful for others. – ab m Jun 09 '17 at 20:11

score 1 · Accepted Answer · answered Jun 12 '17 at 08:04

The solution was to use the pattern analyser. Without having to configure it further (no custom pattern specified) it breaks up the EDIFACT message along non-word/number characters.

The problem with the standard analyser was that it behaved odd with ':'. So if you e.g. had RFF+ATB:AB12345; it broke it up into [rff, atb:ab12345] so a search for ab12345 did not return anything.

You can test how a analyser or tokenizer works by using

curl -XPOST --netrc-file ~/curl_user 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text":      "UNB+UNOA:2+SENDER+RECEIVER+170513:0452+129910165"
}'

You can replace 'analyzer' with tokenizer if you just want to test the tokenizer used.

score 0 · Answer 3 · answered Jun 08 '17 at 18:46

0

I think you have "query" and "match_phrase" inverted:

Can you try it like this:

{
    "query": {
        "match_phrase": {
            "MESSAGE": "UNH+66304"
        }
    }
}

answered Jun 08 '17 at 18:46

ugosan

1,465
16
14

If you look at my first example you will see that. The other code examples omitted the first "query" to shorten the post (but where still used in testing). You can have a 'second' "query" if you want to specify more (e.g. the operator). – Ben Jun 09 '17 at 07:15

Search for exact phrase with Elasticsearch

3 Answers3