Folding/Normalizing Ligatures (e.g. Æ to ae) Using (Core)Foundation

Question

I am writing a helper that performs a number of transformations on an input string, in order to create a search-friendly representation of that string.

Think of the following scenario:

Full text search on German or French texts
The entries in your datastore contain
1. Müller
2. Großmann
3. Çingletòn
4. Bjørk
5. Æreogramme
The search should be fuzzy, in that
1. ull, Üll etc. match Müller
2. Gros, groß etc. match Großmann
3. cin etc. match Çingletòn
4. bjö, bjo etc. match Bjørk
5. aereo etc. match Æreogramme

So far, I've been successful in cases (1), (3) and (4).

What I cannot figure out, is how to handle (2) and (5).

So far, i've tried the following methods to no avail:

CFStringNormalize() // with all documented normalization forms
CFStringTransform() // using the kCFStringTransformToLatin, kCFStringTransformStripCombiningMarks, kCFStringTransformStripDiacritics
CFStringFold() // using kCFCompareNonliteral, kCFCompareWidthInsensitive, kCFCompareLocalized in a number of combinations -- aside: how on earth do I normalize simply _composing_ already decomposed strings??? as soon as I pack that in, my formerly passing tests fail, as well...

I've skimmed over the ICU User Guide for Transforms but didn't invest too heavily in it…for what I think are obvious reasons.

I know that I could catch case (2) by transforming to uppercase and then back to lowercase, which would work within the realms of this particular application. I am, however, interested in solving this problem on a more fundamental level, hopefully allowing for case-sensitive applications as well.

Any hints would be greatly appreciated!

score 7 · Answer 1 · answered Mar 18 '13 at 20:10

Congratulations, you've found one of the more painful bits of text processing!

First off, NamesList.txt and CaseFolding.txt are indispensable resources for things like this, if you haven't already seen them.

Part of the problem is you're trying to do something almost correct that works in all the languages/locales you care about, whereas Unicode is more concerned about doing the correct thing when displaying strings in a single language-locale.

For (2), ß has canonically case-folded to ss since the earliest CaseFolding.txt I can find (3.0-Update1/CaseFolding-2.txt). CFStringFold() and -[NSString stringByFoldingWithOptions:] ought to do the right thing, but if not, a "locale-independent" s.upper().lower() appears to give a sensible answer for all inputs (and also handles the infamous "Turkish I").

For (5), you're a little out of luck: Unicode 6.2 doesn't appear to contain a normative mapping from Æ to AE and has changed from "letter" to "ligature" and back again (U+00C6 is LATIN CAPITAL LETTER A E in 1.0, LATIN CAPITAL LIGATURE AE in 1.1, and LATIN CAPITAL LETTER AE in 2.0). You could search NamesList.txt for "ligature" and add a bunch of special cases.

Notes:

CFStringNormalize() doesn't do what you want. You do want to normalize strings before adding them to the index; I suggest NFKC at the start and end of other processing.
CFStringTransform() doesn't quite do what you want either; all the scripts are "latin"
CFStringFold() is order-dependent: The combining ypogegrammeni and prosgegrammeni are stripped by kCFCompareDiacriticInsensitive but converted to a lowercase iota by kCFCompareCaseInsensitive. The "correct" thing appears to be to do the case-fold first followed by the others, although stripping it may make more sense linguistically.
You almost certainly do not want to use kCFCompareLocalized unless you want to rebuild the search index every time the locale changes.

Readers from other languages note: Check that the function you use is not dependent on the user's current locale! Java users should use something like s.toUpperCase(Locale.ENGLISH), .NET users should use s.ToUpperInvariant(). If you actually want the user's current locale, specify it explicitly.

+1 **Awesome!** I’ve already come to the conclusion, that I’ll never get an answer to this question. I’m no longer working on this problem, so it’ll take some time for me to fully appreciate this one — I guess I have a bit of reading to do over the weekend! — danyowdee, Mar 19 '13 at 13:03

score 0 · Answer 2 · answered Sep 18 '16 at 02:09

I've used the following extension on String which seems to work nicely.

/// normalized version of string for comparisons and database lookups.  If normalization fails or results in an empty string, original string is returned.
var normalized: String? {
    // expand ligatures and other joined characters and flatten to simple ascii (æ => ae, etc.) by converting to ascii data and back
    guard let data = self.data(using: String.Encoding.ascii, allowLossyConversion: true) else {
        print("WARNING: Unable to convert string to ASCII Data: \(self)")
        return self
    }
    guard let processed = String(data: data, encoding: String.Encoding.ascii) else {
        print("WARNING: Unable to decode ASCII Data normalizing stirng: \(self)")
        return self
    }
    var normalized = processed

    //  // remove non alpha-numeric characters
    normalized = normalized.replacingOccurrences(of: "?", with: "") // educated quotes and the like will be destroyed by above data conversion
    // strip appostrophes
    normalized = normalized.replacingCharacters(in: "'", with: "")
    // replace non-alpha-numeric characters with spaces
    normalized = normalized.replacingCharacters(in: CharacterSet.alphanumerics.inverted, with: " ")
    // lowercase string
    normalized = normalized.lowercased()

    // remove multiple spaces and line breaks and tabs and trim
    normalized = normalized.whitespaceCollapsed

    // may return an empty string if no alphanumeric characters!  In this case, use the raw string as the "normalized" form
    if normalized == "" {
        return self
    } else {
        return normalized
    }
}

Folding/Normalizing Ligatures (e.g. Æ to ae) Using (Core)Foundation

2 Answers2

Linked