I am writing a helper that performs a number of transformations on an input string, in order to create a search-friendly representation of that string.
Think of the following scenario:
- Full text search on German or French texts
- The entries in your datastore contain
Müller
Großmann
Çingletòn
Bjørk
Æreogramme
- The search should be fuzzy, in that
ull
,Üll
etc. matchMüller
Gros
,groß
etc. matchGroßmann
cin
etc. matchÇingletòn
bjö
,bjo
etc. matchBjørk
aereo
etc. matchÆreogramme
So far, I've been successful in cases (1), (3) and (4).
What I cannot figure out, is how to handle (2) and (5).
So far, i've tried the following methods to no avail:
CFStringNormalize() // with all documented normalization forms
CFStringTransform() // using the kCFStringTransformToLatin, kCFStringTransformStripCombiningMarks, kCFStringTransformStripDiacritics
CFStringFold() // using kCFCompareNonliteral, kCFCompareWidthInsensitive, kCFCompareLocalized in a number of combinations -- aside: how on earth do I normalize simply _composing_ already decomposed strings??? as soon as I pack that in, my formerly passing tests fail, as well...
I've skimmed over the ICU User Guide for Transforms but didn't invest too heavily in it…for what I think are obvious reasons.
I know that I could catch case (2) by transforming to uppercase and then back to lowercase, which would work within the realms of this particular application. I am, however, interested in solving this problem on a more fundamental level, hopefully allowing for case-sensitive applications as well.
Any hints would be greatly appreciated!