Is there a way to specify an Azure Cognitive Search analyzer that doesn't break on hyphens but will break on other punctuation

Question

The phrase "non-emergency" is semantically different than the two words in isolation "non" and "emergency".

In particular, searching for "emergency" should not match "non-emergency". However, it still makes sense to split up all other words by all other punctuation. E.g.

"In a situation that is a non-emergency, do not call 911."

The whitespace analyzer is not what I want, since I still want to break on other punctuation marks that don't have as much (any?) semantic implications.

This seems like a perfectly common and reasonable use case that that many people would want to have, but it doesn't seem to be available in Azure Cognitive Search (ACS).

This post seems to suggest that Lucene has it: Lucene Index problems with "-" character

I'm still struggling with installation of ACS, but in a few emails with MS people, I didn't get a satisfactory (simple) answer on how to do this. I just know a bit about Lucene to know that this is what I want...

Thanks in advance.

score 0 · Answer 1 · answered Mar 08 '21 at 18:55

The post you shared seems to imply that the ClassicAnalyzer in Lucene is the solution you are looking for. While the classic analyzer isn't supported by default in Azure Cognitive Search, you should be able to create a custom analyzer that utilizes the ClassicTokenizer, which is supported and is probably the closest to what you are looking for.

Another option you may want to consider is the PatternAnalyzer, which is supported by Azure Cognitive Search, so you can define the regex pattern that works best for you.

Documentation that is helpful for implementing both of these options as well as trying many others: https://learn.microsoft.com/azure/search/index-add-custom-analyzers

Is there a way to specify an Azure Cognitive Search analyzer that doesn't break on hyphens but will break on other punctuation

1 Answers1