Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?

Question

Python allows unicode identifiers. I defined Xᵘ = 42, expecting XU and Xᵤ to result in a NameError. But in reality, when I define Xᵘ, Python (silently?) turns Xᵘ into Xu, which strikes me as somewhat of an unpythonic thing to do. Why is this happening?

>>> Xᵘ = 42
>>> print((Xu, Xᵘ, Xᵤ))
(42, 42, 42)

PyCharm (2013.2.3) flags the `Xu, Xᵤ` as `unresolved references` but the code runs nonetheless — Ma0, Jan 23 '18 at 15:12
@Ev.Kounis: that'd be a bug in PyCharm, they are forgetting to normalise to the NFKC form. — Martijn Pieters, Jan 23 '18 at 15:16
How do I type these subscripts in Pycharm? This is fantastic. — Steve3p0, Jul 15 '20 at 06:32

Martijn Pieters · Accepted Answer · 2018-01-23T15:42:13.017

Python converts all identifiers to their NFKC normal form; from the Identifiers section of the reference documentation:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

The NFKC form of both the super and subscript characters is the lowercase u:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'Xᵘ Xᵤ')
'Xu Xu'

So in the end, all you have is a single identifier, Xu:

>>> import dis
>>> dis.dis(compile('Xᵘ = 42\nprint((Xu, Xᵘ, Xᵤ))', '', 'exec'))
  1           0 LOAD_CONST               0 (42)
              2 STORE_NAME               0 (Xu)

  2           4 LOAD_NAME                1 (print)
              6 LOAD_NAME                0 (Xu)
              8 LOAD_NAME                0 (Xu)
             10 LOAD_NAME                0 (Xu)
             12 BUILD_TUPLE              3
             14 CALL_FUNCTION            1
             16 POP_TOP
             18 LOAD_CONST               1 (None)
             20 RETURN_VALUE

The above disassembly of the compiled bytecode shows that the identifiers have been normalised during compilation; this happens during parsing, any identifiers are normalised when creating the AST (Abstract Parse Tree) which the compiler uses to produce bytecode.

Identifiers are normalized to avoid many potential 'look-alike' bugs, where you'd otherwise could end up using both ﬁnd() (using the U+FB01 LATIN SMALL LIGATURE FI character followed by the ASCII nd characters) and find() and wonder why your code has a bug.

This might be a stupid question but should there not be a step in the `dis` where the name is converted to its NFKC form? In other words, should it not take a *tad* longer to define a value like that? — Ma0, Jan 23 '18 at 15:17
@Ev.Kounis: no, because the identifier has been normalised *before* bytecode is produced, when parsing (a stage that converts tokens into an AST, which the compiler then uses to produce bytecode). — Martijn Pieters, Jan 23 '18 at 15:21
I see. Yet normalisation does not prevent `a٨ = 42; a۸ = 43; a٨ == a۸` resulting in `False`… — gerrit, Jan 23 '18 at 17:07
@gerrit: 'many potential' is not *all* potential bugs. See the [codepoints.net page for *U+0668 ARABIC-INDIC DIGIT EIGHT*](https://codepoints.net/U+0668) for more options to confuse that codepoint with. — Martijn Pieters, Jan 23 '18 at 17:16
@gerrit: there is a good reason that the Python style guide recommends all variable names to be ASCII-only English terms. — Martijn Pieters, Jan 23 '18 at 17:19

score 3 · Answer 2 · answered Jan 23 '18 at 15:14

Python, as of version 3.0, supports non-ASCII identifiers. When parsing the identifiers are converted using NFKC normalization and any identifiers where the normalized value is the same are considered the same identifier.

See PEP 3131 for more details. https://www.python.org/dev/peps/pep-3131/

Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?

2 Answers2

Linked