2

I am looking for a way to count ligatures as single units as they are displayed to user, e.g. https://www.compart.com/en/unicode/U+FEFB.

When this character is typed (type G on Arabic keyboard), it's inserted in decomposition form, i.e. U+0644 U+0627.

I'm able to decompose U+FEFB by

escape(String.fromCodePoint(0xFEFB).normalize("NFKD")) // '%u0644%u0627'

Is there a way to compose U+0644 U+0627 into 0xFEFB?

Why this does work?

escape(String.fromCodePoint(0x0644, 0x0627).normalize("NFKC"))

The only idea I has was to iterate over unicode ranges I'm interested in, decompose and create a map, but I'm hoping there's a better way.

psmn
  • 131
  • 2
  • 5
  • 1
    chars -> ligature is not a function (in the math sense), because there can be multiple ligatures decomposing to the same chars. In your case, lam+alef can be isolated (FEFB) or final (FEFC) and there's no way for the composer to know what you need. – georg Nov 14 '19 at 22:37
  • @georg Thanks a lot, now it makes complete sense to me. – psmn Nov 15 '19 at 06:51

1 Answers1

2

Given that the ES2019 spec requires the implementation to:

Let ns be the String value that is the result of normalizing S into the normalization form named by f as specified in https://unicode.org/reports/tr15/.

and given that https://www.unicode.org/Public/12.1.0/ucd/NormalizationTest.txt describes that character as

FEFB;FEFB;FEFB;0644 0627;0644 0627; # (ﻻ; ﻻ; ﻻ; لا; لا; ) ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM

it is the compliant behaviour. See

# 1. The following invariants must be true for all conformant implementations
#
#    NFC
#      c2 ==  toNFC(c1) ==  toNFC(c2) ==  toNFC(c3)
#      c4 ==  toNFC(c4) ==  toNFC(c5)
#
#    NFD
#      c3 ==  toNFD(c1) ==  toNFD(c2) ==  toNFD(c3)
#      c5 ==  toNFD(c4) ==  toNFD(c5)
#
#    NFKC
#      c4 == toNFKC(c1) == toNFKC(c2) == toNFKC(c3) == toNFKC(c4) == toNFKC(c5)
#
#    NFKD
#      c5 == toNFKD(c1) == toNFKD(c2) == toNFKD(c3) == toNFKD(c4) == toNFKD(c5)

No normalisation converts either c4 or c5 form back to c1, or c2, or c3.

So to my unicode-amateur opinion there is no standard-compliant way to normalise U+0644 U+0627 back to U+FEFB.

zerkms
  • 249,484
  • 69
  • 436
  • 539