86

I was playing around with Unicode identifiers and stumbled upon this:

>>> , x = 1, 2
>>> , x
(1, 2)
>>> , f = 1, 2
>>> , f
(2, 2)

What's going on here? Why does Python replace the object referenced by , but only sometimes? Where is that behavior described?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Erik Cederstrand
  • 9,643
  • 8
  • 39
  • 63

2 Answers2

87

PEP 3131 -- Supporting Non-ASCII Identifiers says

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

You can use unicodedata to test the conversions:

import unicodedata

unicodedata.normalize('NFKC', '')
# f

which would indicate that '' gets converted to 'f' in parsing. Leading to the expected:

  = "Some String"
print(f)
# "Some String"
Erik Cederstrand
  • 9,643
  • 8
  • 39
  • 63
Mark
  • 90,562
  • 7
  • 108
  • 148
  • 26
    This is a great answer, but a terrible decision by the Python core devs. I note that in the discussion of this PEP, one of the objections was that Unicode is poorly-understood and has weak tooling. Now, over a decade later, I wonder if it's time to re-think the romanization of Unicode identifiers. – Adam Smith Jun 08 '20 at 06:48
  • 36
    @AdamSmith but Unicode normalisation isn't romanisation. You can have `π` as a Python identifier that is distinct from `p` just fine. If I understand correctly, the NFK* folding is about characters that the Unicode folks thought should have been the same character to begin with, but they can't be merged because of backwards-compatibility with some legacy encodings. – lenz Jun 08 '20 at 08:33
  • 21
    There are two kinds of character equivalence: canonical and compatibility. Canonical equivalence should render the exact same glyph, which is not the case between and f. NFKC normalizes both canonical and compatibility equivalences, which I agree is a bad choice for a programming language like Python, who differentiates between letter cases: it is expected that identifiers that render differently should be different. Python should have used NFC, which ensures and f are different things. – lvella Jun 08 '20 at 14:52
  • 30
    Some form of normalization is needed because of, for example, latin characters with diacritics - if I see a character like 'ü' then it might be either a composite character (u + combining diaeresis) or a precomposed single character; the user would have no reasonable way or desire to distinguish them, and their preferred input method would likely allow to input only one of these options. So it's desirable that if i see 'ü' and type 'ü' then the language considers the characters as equivalent even if they're encoded differently, though NFC normalization would probably be sufficient for that. – Peteris Jun 08 '20 at 16:15
  • 10
    Python supports Unicode for identifiers in order to facilitate its use in defining identifiers in non-English languages, not to provide equal access to all Unicode code points. For example, it is currently quite difficult to hack the parser to support Unicode operators, because any non-ASCII character is first assumed to be part of an identifier, even if the Unicode character in question isn't a valid part of an identifier. The idea is not to support mining Unicode for "interesting" characters, but to support characters produced by standard non-English keyboard layouts. – chepner Jun 08 '20 at 19:24
  • 4
    Characters like `` are "accidentally" allowed, because they are categorized as letters, even if no natural language uses it as part of its writing system. – chepner Jun 08 '20 at 19:26
  • 7
    Using NKFC for identifiers is the recommendation on the Unicode website https://unicode.org/faq/normalization.html – user7868 Jun 09 '20 at 05:19
  • 1
    and is a unicode glyph that, sadly, is an illegal name in Python. Even ⺒ is illegal. – Wayne Werner Nov 10 '20 at 20:27
31

Here's a small example, just to show how horrible this "feature" is:

ᵢ_fᵣₑ_ₕ_dₑᵢiℓy___ᵘg = 42
print(Tℹ_eᵣe_ₛº_eᵢⁱtᵉ_ℯ__)
# => 42

Try it online! (But please don't use it)

And as mentioned by @MarkMeyer, two identifiers might be distinct even though they look just the same ("CYRILLIC CAPITAL LETTER A" and "LATIN CAPITAL LETTER A")

А = 42
print(A)
# => NameError: name 'A' is not defined
Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
  • 3
    Makes me want to write an equivalent of jsfuck.com... python-unicode-hell.com ? – Mathieu VIALES Jun 08 '20 at 17:28
  • 2
    @MathieuVIALES ʳₑ ᵗ ᵈ º. I aᵉ ᵉ arₒ. ʷnℯ o ⅽᵉ ʷₜ ᵗ, t ℎₑ ⅈ ᵘ t oible ⁿv ᵘₑⅾ ⅈt. l . – Eric Duminil Jun 08 '20 at 17:32
  • 8
    And then of course: `А = 42; print(A)` --> "NameError: name 'A' is not defined" – Mark Jun 08 '20 at 17:54
  • 10
    The point was never to open the door to arbitrarily complex identifier names, but to facilitate typing identifiers in a programmer's native language (using a keyboard layout native to that language). Better to go by Unicode's classification of a code point as a letter than to act as the arbiter for which writing systems can and cannot be used for identifiers. (And limiting an identifier to characters from a single writing system is far beyond the parser's pay grade.) – chepner Jun 08 '20 at 19:32
  • @chepner: But then, wouldn't it be better to allow both `` and `` for example, but make sure they are distinct identifiers? Also, `ⓧ` somehow isn't a valid identifier, even though it normalizes to `x` as well. – Eric Duminil Jun 08 '20 at 19:49
  • 13
    None of those code points are part of any natural language's writing system, so whether any of them are acceptable as part of an identifier is almost "accidental", based on Unicode classification rather than any explicit endorsement by Python itself. – chepner Jun 08 '20 at 20:00
  • 2
    @chepner: Thanks. It (almost) makes sense to me, even if the resulting behavior can be really surprising. Python is in general very well designed, so these kind of quirks stick out like a sore thumb. Nobody would notice or care if it were just another JavaScript WTF. – Eric Duminil Jun 08 '20 at 20:06
  • @chepner: which country has a keyboard layout that does not have latin letters on it? I'm interested to find one. E.g. Chinese type Chinese by using Pinyin, which again, is latin. – Thomas Weller Sep 18 '20 at 10:24
  • 1
    @ThomasWeller There are *lots* of other writing systems aside from the Latin alphabet, and most of them are more conducive to typing than the Chinese writing system. Tamil, Hindi, Cyrillic, Georgian, Armenian, .... I'll plead ignorance as to how people using those systems actually code, but I assume it's not uncommon to switch between an "English" layout and their native layout while coding to facilitate typing regular ASCII and native identifiers. – chepner Sep 18 '20 at 12:10
  • (Given that the CPython parser was *only* modified to allow arbitrary Unicode in identifiers, without the ability to handle it in other places like operators or keywords, I assume this change was made by specific request from Python users.) (I haven't looked at the new PEG parser to see if it handles Unicode differently.) – chepner Sep 18 '20 at 12:15