Index or scan?
In your example, I assume word
was indexed? And the table is rather large? In that case, an index is very powerful for finding the row(s) that match for the specified collation. Any other collation (your case with COLLATE ...
) requires ignoring the index and scanning the entire table (slow).
Using REGEXP
will also require a full table scan.
Doing REPLACE(...)
on the fly will require a full table scan.
So, given that there are "too many" rows to allow for a full table scan, there is no simple way to search for sır
or sir
for arbitrary collations. Let's look for another way...
First, let's look at some collation cases.
In most utf8 collations, including utf8_unicode_ci, I=i=Ì=Í=Í=Î=Ï=ì=í=í=î=ï=Ī=ī=Į=į=İ
In utf8_turkish_ci, I=ı, but they come before i=Ì=Í=Í=Î=Ï=ì=í=í=î=ï=Ī=ī=Į=į=İ (and hence, explains part of your problem).
Most other utf8 collations treat ı as coming between i and j.
utf8_general_ci is slightly different: I=i=Ì=Í=Í=Î=Ï=ì=í=í=î=ï=Ī=ī=Į=į=İ=ı
utf8_icelandic_ci treats some accented i's as a different letter: I=i=Ì=Î=Ï=ì=î=ï=Ī=ī=Į=į=İ, but Í=Í=í=í fall between I and J. That one also sorts these after Z: Ä=Æ=ä=æ Ö=Ø=ö=ø Å=å
See my Reference
Another problem: Accented letters can be presented as two utf8 characters - the letter, then a "non-spacing mark"
So, what to do??
Plan A: Build your own collation. This takes some work, but is not impossible. This would be optimal, giving you the best performance. As we will see below, utf8_unicode_520_ci comes very close, and could be a starting point.
Plan B: Add another column in your table; let's call it wordx
. It would have word
, but will tall accents stripped. Then index wordx
instead of word
. So the row with word='sır' would have wordx='sir'.
Plan C: Using the equivalences discussed above, pick one type if "I" from each group for the collation defined in the table, then use utf8_unicode_520_ci
with UNION
:
( SELECT ... WHERE word = 'sir' )
UNION ALL
( SELECT ... WHERE word = 'sır' )
This should catch all flavors of i
.
Oops, what about words with multiple different accentable letters? At first glance, maybe it is not a problem -- all a
collate equal in utf8_unicode_520_ci. Glancing through the rest of the letters, I don't see anything other than ı
giving a problem.
Here are the interesting equivalences for utf8_unicode_520_ci
:
A=a=ª=À=Á=Á=Â=Ã=Ä=Å=à=á=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae=Æ=æ az B=b C=c=Ç=ç=Č=č ch cz
D=d=Ð=ð=Ď=ď dz E=e=È=É=É=Ê=Ë=è=é=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě F=f fz ƒ
G=g=Ğ=ğ=Ģ=ģ gz H=h hz I=i=Ì=Í=Í=Î=Ï=ì=í=í=î=ï=Ī=ī=Į=į=İ ij=ij iz ı J=j
K=k=Ķ=ķ L=l=Ĺ=ĺ=Ļ=ļ=Ł=ł lj=LJ=Lj=lj ll lz M=m N=n=Ñ=ñ=Ń=ń=Ņ=ņ=Ň=ň nz
O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=Ø=ò=ó=ó=ô=õ=ö=ø oe=Œ=œ oz P=p Q=q R=r=Ř=ř S=s=Ş=ş=Š=Š=š=š
sh ss=ß sz T=t=Ť=ť TM=tm=™ tz U=u=Ù=Ú=Ú=Û=Ü=ù=ú=ú=û=ü=Ū=ū=Ů=ů=Ų=ų ue uz V=v
W=w X=x Y=y=Ý=Ý=ý=ý=ÿ=Ÿ yz Z=z=Ž=Ž=ž=ž zh zz Þ=þ µ
If you consider ƒ
to be an accented letter, then you might want to deal with it, too.
Your question is interesting.