2

So I have one application (let's call it the client) which uses strings with Diacritic/Accents. This application needs to make a request to another application (let's call it web service) using these strings with a diacritic. This other application is developed to receive such strings.

But when the client makes a request to the web service, things do not work as expected. On investigation, I realize the problem is with the diacritics.

Basically, it seems some diacritics which appear the same to the naked eye have different Unicode representation.

For example, what is normally called the acute accent: I realize there is one with octal representation of 01401 https://unicodelookup.com/#01401/1 and another one with octal representation of 01501 https://unicodelookup.com/#01501/1

The one with an octal representation of 01401 is referred to as combining acute accent while the one with 01501 is referred to as combining acute tone mark. So apart from having the different representation they appear to also be semantic different.

And this is the root of the problem I am having. The client creates its strings with diacritics that are called combining acute tone mark while the client expects strings with combining acute accent

So question is, what exactly is the difference between these two? (googling does not seem to turn up some anything helpful here) and how may I "normalize/convert" between these two representations (as I believe this is what needs to be done) in other for the client to be able to make successful calls to the web service.

Update May I mention that strings being sent by the clients are sent from the browser, and thus I can copy the string and look it up using the tool at unicodelookup.com.

And I just did the same lookup, but from a different computer. Before, I was doing the look-up on my Mac. When I did the lookup (by copying from the browser URL and pasting the character into unicodelookup.com), what returning as combining acute tone mark now returns as combining acute accent

Thought I should mention this observation.

dade
  • 3,340
  • 4
  • 32
  • 53
  • Looking at the relevant [Unicode chart](http://www.unicode.org/charts/PDF/U0300.pdf), I see that the combining tone mark U+0341 is from Vietnamese, so which diacritic you mean depends on the language that the word in question is from. The unicode standard also says that 0341 is discouraged and 0301 (the accent) should be used instead. – Kerrek SB Sep 24 '17 at 17:50
  • @KerrekSB thanks for chiming in. May I ask, how the actual language will help in deciphering the problem? Let's assume the language is actually Vietnamese, what then will be needed to resolve the difference between "combining acute accent" and "combining acute tone mark". Are there some settings I need to update on both client/service to make sure they are in synchronization? tnx – dade Sep 24 '17 at 19:13
  • That's a font issue. E.g. OTF has the GSUB and GPOS tables, which are parametrised by language. So your application needs to pass the language to the typesetter, which then runs the Unicode algorithm followed by GSUB and GPOS lookups to produce the desired glyph positions. In other words, "text = character data + language". – Kerrek SB Sep 24 '17 at 19:25
  • To make sure I follow you, you saying it is possible to have the same binary representation of a character (or Unicode code point?) but after the interpretation by two different fronts you can end up with two different code points being displayed? I am aware of encoding problems pre-Unicode, where a character is encoded with a charset but decoded with a different charset, but I thought the idea behind Unicode is to prevent this? Thus if you have a sequence of Unicode codepoints you assured they would be decoded uniformly? Seems this is not always the case? – dade Oct 04 '17 at 11:10

1 Answers1

1

You can do Unicode normalization in the browser (as of ES6) by using JavaScript's String.prototype.normalize method.

When accepting Unicode user input, normalization is something that should be done when any sort of comparison is involved, for more than just the reason you discovered. Which normalization form you should use depends on the use case, but NFC is a good one to default to (which is what .normalize() with no argument does).

The combining marks you mentioned are "canonically equivalent", so normalizing both with the same form (doesn't matter which one in this specific case) will always make them equal:

var accented = "a\u0300"  // "à"
var toneMarked = "a\u0340"  // "à"

accented === toneMarked  // false

accented === accented.normalize("NFD")  // true, this is already the canonical decomposed form
toneMarked === toneMarked.normalize("NFD")  // false
accented === toneMarked.normalize("NFD")  // true

accented.normalize("NFC") === toneMarked.normalize("NFC")  // true
accented.normalize() === toneMarked.normalize()  // equivalent to above

// Also have to consider precomposed characters! (single code point)

var composed = "à"

composed === accented  // false

composed.normalize("NFD") === accented  // true

composed === composed.normalize()  // true, this is already the canonical composed form
composed === accented.normalize() && composed === toneMarked.normalize()  // true

Note that normalization should usually be performed on the backend since you can't assume anything about data coming from a user client.

Inkling
  • 3,544
  • 4
  • 30
  • 44