17

I have two strings in Javascript: "_strange_chars_µö¬é@zendesk.com.eml" (f1) and "_strange_chars_µö¬é@zendesk.com.eml" (f2). At first glance, they look identical (and, indeed, on StackOverflow, they may be; I'm not sure what happens when they are pasted into a form like this.) In my application, however,

f1[16] // ö
f2[16] // o
f1[17] // ¬
f2[17] // ̈

That is, where f1 uses the ö character, f2 uses an o and a diacritic ¨ as a separate character. What comparison can I do that will show these two strings to be "equal"?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
James A. Rosen
  • 64,193
  • 61
  • 179
  • 261
  • 4
    One solution -- perhaps the only one -- would be to "canonicalize" (in the Unicode sense) the two strings, but I haven't been able find a library or function for that yet. – James A. Rosen Aug 17 '11 at 18:53
  • 1
    Are you sure that you have declared UTF-8 in your meta tags? – cwallenpoole Aug 17 '11 at 18:56
  • Great question, @cwallenpoole. I'm not, but I'll double-check now. The two strings I've described definitely _can_ both be valid Unicode, but I'm not certain they _are_. – James A. Rosen Aug 17 '11 at 19:02
  • @cwallenpoole the page declares `` and the form (a file input is the source of the first string) declares `accept-charset="UTF-8"`. And, of course, the HTTP request and response are also UTF-8. I think this is just a case of different systems (browser vs. server) using different Unicode canonicalization. (Or using versus not using canonicalization.) – James A. Rosen Aug 17 '11 at 19:13

1 Answers1

8

f1 uses the ö character, f2 uses an o and a diacritic ¨ as a separate character.

f1 is in Normal Form C (composed) and f2 in Normal Form D (decomposed). In general Normal Form C is the most common on Windows and the web, with the Unicode FAQ describing it as “the best form for general text”. Unfortunately the Apple world plumped for Normal Form D in order to be gratuitously different.

The strings are canonically equivalent by the rules of Unicode equivalence.

What comparison can I do that will show these two strings to be "equal"?

In general, you convert both strings to one Normal Form of your choosing and then compare them. For example in Python:

>>> import unicodedata
>>> a= u'\u00F6'  # ö composed
>>> b= u'o\u0308' # o then combining umlaut
>>> unicodedata.normalize('NFC', a)==unicodedata.normalize('NFC', b)
True

Similarly Java has the Normalizer class, .NET has String.Normalize, and may languages have bindings available to the ICU library which also offers this feature.

Unfortunately, JavaScript has no native Unicode normalisation ability. This means either:

  • doing it yourself, carting around large Unicode data tables to cover it all in JavaScript (see eg here for an example implementation); or

  • sending it back to the server-side (eg via XMLHttpRequest), where you've got a better-equipped language to do it.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • 1
    Your statement about Apple is disingenuously untrue. Apple’s HSF+ filesystem uses (whilom-)NFD for perfectly sensible reasons. Precombined characters are considered compatibility characters by Unicode for roundtripping with legacy encodings, and are *not* the preferred form for internal use as you have here mispresented. The standard recommendation is to NFD all incoming data as the very first step before you have your way with it, and to NFC all outgoing data as the very last step before you two part ways. Singletons are in consequence mutated, but that’s bound to happen eventually anyway. – tchrist Aug 17 '11 at 22:33
  • 1
    @tchrist: citation on composed characters being “compatibility”? They're certainly not Compatibility in the literal sense as there is after all Normal Form KC. The official [FAQ](http://www.unicode.org/faq/normalization.html) prefers NFC/NFKC, mentioning decomposition only as useful for internal handling. But filenames on HFS+ and UFS are *not* only internal, that data comes back to applications, and this has made many of them fall over. The OS X filesystem does not normalise filenames back to NFC on the way back out as you suggest should be done. – bobince Aug 18 '11 at 09:33
  • (Personally I think both case-insensitivity and composition-insensitivity are undesirable features in a filesystem, but at least in Windows's case you get the case back that you originally put in.) – bobince Aug 18 '11 at 09:34
  • JS has this now - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize. Unfortunately IE11 doesn't support this, and while you can use an external library like https://github.com/walling/unorm, it is massive for used on the frontend – Yi Jiang Feb 19 '20 at 00:19