Why does NFKC normalization lose superscript & subscript info?

Asked Apr 26 '18 at 21:09

Active Apr 27 '18 at 20:25

Viewed 139 times

I notice that when normalizing a Unicode string to NFKC form, superscript characters like ¹ (U+00B9), ² (U+00B2), ³ (U+00B3), etc are converted to the corresponding ASCII digit (ex. 1, 2, 3, etc).

Does anyone know the rationale for this behavior? It seems like it's losing information in the process. For example, a superscript number usually has some contextual meaning.

edited Apr 27 '18 at 20:25

Remy Lebeau

555,201
31
458
770

asked Apr 26 '18 at 21:09

codesniffer

1,033
9
22

1

The "K" apparently stands for "compatibility" (well... I guess the "C" was already used for "canonical"). [Wikipedia](https://en.wikipedia.org/wiki/Unicode_equivalence) says: "Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others." – lenz Apr 27 '18 at 09:16

Why does NFKC normalization lose superscript & subscript info?

0 Answers0