String comparison that ignores special unicode charaters

Question

I have a table with city names in various languages. These names often contain characters such as é, ü etc. I am building a search function so that a user can search for a name, but I would like this to ignore all special characters, and treat them as letters from the English 26-letter alphabet instead.

For instance, a query for the name munchen should be matched with the value München. The query converts both sides of the comparison to case-insensitive 26-letter alphabet, and then performs the comparison.

Is there a straightforward way to achieve this?

There are no special characters in Unicode. Especially not the plain-old Latin1 characters. What are you *really* trying to do? Converting everything to English is an attempt to solve that, but it won't work except for some very limited cases. — Panagiotis Kanavos, Mar 13 '19 at 08:50
@PanagiotisKanavos correct, I didn't even think about that.... — Zohar Peled, Mar 13 '19 at 08:52
If you use an accent-insensitive collation for example, then Munchen could match München. SQL Server's Full Text Search feature can already handle such matches, no matter the collation. Any search engine for that matter can handle this — Panagiotis Kanavos, Mar 13 '19 at 08:53
SO itself is an ASP.NET site storing questions and answers in `nvarchar` fiels in SQL Server. It doesn't need any special handling for any kind of text, like Αυτό Εδώ. Unicode just works in both .NET and SQL Server. — Panagiotis Kanavos, Mar 13 '19 at 08:56
In fact, there's a German User whose user name is [πάντα ῥεῖ](https://stackoverflow.com/users/1413395/%CF%80%CE%AC%CE%BD%CF%84%CE%B1-%E1%BF%A5%CE%B5%E1%BF%96). Well, I don't have the time to install a Greek polytonic keyboard, or remember the correct accents, so I just searched for `παντα ρει`, no accents whatsoever and I still got back a list of questions, answers and comments where he's involved — Panagiotis Kanavos, Mar 13 '19 at 08:57
@PanagiotisKanavos I do not really understand your question. What I am trying to do is a string comparison that does not take diacritics into account. — Lee White, Mar 13 '19 at 08:58
@LeeWhite There are no special characters. What you ask works OOTB. Use an accent-insensitive collation for your field and you won't have to change anything. That will cover equality or `LIKE` searches, although `LIKE '%...%'` will always be slow. If you want to search arbitrary text, use SQL Server's Full text search. Or add another search engine like ElasticSearch — Panagiotis Kanavos, Mar 13 '19 at 08:59
@LeeWhite check [Collation and Unicode Support](https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017). Apart from the UTF8 collation, everything else holds for all SQL Server versions. — Panagiotis Kanavos, Mar 13 '19 at 09:03
I am currently trying with `SQL_Latin1_General_CP1_CI_AS`, and am not getting the expected result. A string compare between `Munchen` and `München` returns false. — Lee White, Mar 13 '19 at 09:10
`AS` means accent *sensitive*. You should use AI, Accent *in*sensitive. The documentation shows how collation is used and how you can override it with the `COLLATE` keyword. Using `COLLATE` will prevent the use of indexes that don't match this collation though - collation affects ordering, not just matching, and indexes depend that. — Panagiotis Kanavos, Mar 13 '19 at 09:57

score 2 · Answer 1 · edited Mar 17 '19 at 11:02

Those characters aren't special in any way. Matching and ordering is affected by a column's collation - this specifies the sorting order and which characters match.

A collation can be case-sensitive or not, which specifies whether Mary is considered equal to mary. It can also be accent sensitive or not, which specifies whether Munchen is equal to München or not.

To match Munchen with München, all that's needed is to use an Accent Insensitive collation for that column, eg :

create table Cities (
    id int identity primary key,
    Name nvarchar(200) COLLATE Latin1_General_CI_AI ,
    INDEX IX_Cities_NAME (Name)
)

insert into Cities (Name)
values ('London'), ('München'), ('Munchen')

select * 
from Cities 
where Name = N'mUnchen'

The result is :

Name
--------
München
Munchen

I didn't make Name a primary key because I wouldn't be able to insert both München and Munchen, I'd get a primary key violation

String comparison that ignores special unicode charaters

1 Answers1