Python3 surprising behavior of identifier being a non-ASCII Unicode character

Question

Following code runs without an assertion error:

K = 'K'
 = ''
 = ''
 = ''
 = ''
 = ''
ᴷ = 'ᴷ'
assert K ==  ==  ==  ==  == ᴷ
print(f'{K=}, {=}, {=}, {=}, {=}, {=}')

and prints

K='ᴷ', ='ᴷ', ='', ='ᴷ', ='ᴷ', ='ᴷ'

I am aware of https://peps.python.org/pep-3131/ and have read the Python documentation about identifiers https://docs.python.org/3/reference/lexical_analysis.html#identifiers but haven't found any hints explaining the experienced behavior.

So my question is: What is wrong with my expectation that the value of all of the other optical apparently different identifier doesn't change if a new value is assigned to one of them?

UPDATE: taking currently available comments and answers into account raises the need to explain more about what I expect as satisfying answer to my question:

The hint about NFKC conversion behind the comparison of names of identifiers helps to understand how it comes that the experienced behavior is there, but ... it leaves me still with the question opened what is the deep reason behind the choice to have different approaches for comparison of Unicode strings depending on context in which they occur?

The way strings as string literals are compared to each other apparently differs from the way same strings are compared if they specify names of identifiers.

What am I still missing to know about to be able to see the deep reason behind the why it was decided that Unicode strings representing names of identifiers in Python are not compared the same way to each other as Unicode strings representing string literals?

If I understand it right Unicode comes with the possibility to have ambiguous specifications for the same expected outcome using either one code point representing a complex character or multiple code points with an appropriate base character plus its modifiers. Normalization of the Unicode string is then an attempt on the way to resolve the mess caused by introducing the possibility of this ambiguity in first place. But this is the Unicode specific stuff having in my eyes the heaviest impact on Unicode visualization tools like viewer and editors. What a programming language using representation of a string as a list of integer values (Unicode code points) larger than 255 actually implements is another thing, isn't it?

Below some further attempts to find a better wording for the question I seek to get answered:

What is the advantage of creating the possibility that two different Unicode strings are eventually considered not to be different if they are used as names of Python identifiers?

What is the actual feature behind what I am considering to be a not making sense behavior because of broken WYSIWYG ability?

Below some more code illustrating what is going on and demonstrating the difference in comparison between string literals and identifier names originated in same strings as the strings literals:

from unicodedata import normalize as normal
itisasitisRepr = [                char       for char in ['K', '', '', '', '', '', 'ᴷ']]
hexintasisRepr = [         f'{ord(char):5X}' for char in itisasitisRepr]
normalizedRepr = [ normal('NFKC', char)      for char in itisasitisRepr]
hexintnormRepr = [         f'{ord(char):5X}' for char in normalizedRepr]
print(itisasitisRepr)
print(hexintasisRepr)
print(normalizedRepr)
print(hexintnormRepr)
print(f"{              'K' ==              ''  = }")
print(f"{normal('NFKC','K')==normal('NFKC','') = }")
print(ᴷ == , 'ᴷ' == '') # gives: True, False

gives:

['K', '', '', '', '', '', 'ᴷ']
['   4B', '1D542', '1D6B1', '1D50E', '1D576', '1D4DA', ' 1D37']
['K', 'K', 'Κ', 'K', 'K', 'K', 'K']
['   4B', '   4B', '  39A', '   4B', '   4B', '   4B', '   4B']
              'K' ==              ''  = False
normal('NFKC','K')==normal('NFKC','') = True

Does [this](https://stackoverflow.com/questions/34097193/identifier-normalization-why-is-the-micro-sign-converted-into-the-greek-letter) help? Notably the "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC." part? — mcsoini, Nov 29 '22 at 13:00
@mcsoini : yes, the hint about NFKC conversion helps to understand how it comes that the experienced behavior is there, but leaves me still with the question opened what is the deep reason behind the choice to have different approaches for comparison of Unicode strings depending on context in which they occur. The way strings as strings are compared to each other differs from the way same strings are compared if they specify names of identifiers. — Claudio, Nov 29 '22 at 15:40

paxdiablo · Answer 1 · 2022-12-04T01:00:50.813

Python identifiers with non-ASCII characters are subject to NFKC normalisation⁽¹⁾, you can see the effect in the following code:

import unicodedata
for char in ['K', '', '', '', '', '', 'ᴷ']:
    normalised_char = unicodedata.normalize('NFKC', char)
    print(char, normalised_char, ord(normalised_char))

The output of that is:

K K 75
 K 75
 Κ 922
 K 75
 K 75
 K 75
ᴷ K 75

This shows that all but one of those is the same identifier, which is why your assert passes (it's missing the one different identifier) and why most seem to be the same value. It's no different really to the following code, in which it is hopefully immediately clear what will happen:

a = '1'
a = '2'
b = '3'
a = '4'
a = '5'
a = '6'
a = '7'
assert a == a == a == a == a == a             # passes
print(f'{a=}, {a=}, {b=}, {a=}, {a=}, {a=}')  # a=7 a=7 b=3 a=7 a=7 a=7

In response to your update, specifically the text:

What is the advantage of creating the possibility that two different Unicode strings are eventually considered not to be different if they are used as names of Python identifiers?

My own particular viewpoint as a developer is that I want to be able to look at code and understand it. That's not going to be easy when different code-points map to similar or even identical graphemes⁽²⁾, such as with:

Ω = 1
Ω = 2
Ω = Ω + Ω
print(Ω * Ω)

What would you expect from that code? You set omega to one, then two. You then double it to four, and print the square which is sixteen. Easy, right?

And, in actual fact, that's what you do get in Python, despite the fact that there are both omega and ohm characters in that code, and that's because they normalise to the same identifier. Were they not normalised, you would instead have the equivalent of:

omega = 1
ohm = 2
ohm = omega + ohm
print(ohm * ohm)

And this would output nine rather than sixteen. Best of luck debugging that when you can't see a difference between the omega and ohm identifiers :-)

There are also diacritics that can have different representations, such as ḋ:

U+1e0b (Latin Small Letter D with Dot Above).
U+0064, U+0307 (Latin Small Letter D, Combining dot above).

And this may get even more complex where a base letter can have multiple diacritics such as ậ, ç̇, or ė́. The order of combining marks may be arbitrary, meaning that there could be many ways of representing the ậç̇ė́ variable (two by two by two gives eight, but there are potentially more since distinct code points also exist for "half-accented" characters like ç) .

No, I think I very much appreciate the normalisation that happens to Python identifiers :-)

⁽¹⁾ From the Python docs about identifiers:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

⁽²⁾ You can think of graphemes as the basic unit of writing (like a letter), similar to phonemes being the basic unit of speech (like a sound). So the English grapheme c has at least two phonemes, the hard-c in cook and the soft-c in ice.

And, making matters even more complex, cook shows that there is one phoneme (hard-c) giving two separate graphemes, c and k.

Now think how much more complex it gets when you introduce every other language on the planet, I'm surprised the members of the Unicode consortium don't go absolutely insane :-)

Thank you for your attention and the helpful response. Would you please be so kind to explain how NFKC normalization works on single Unicode code points (I seem to understand the sense of normalization for comparison purposes in case the same character can be be specified in different ways) changing their values to another ones and what should it be good for? — Claudio, Nov 29 '22 at 17:01
@Claudio A string with one code point is normalized just like a string with 5000. Length doesn't matter. — Shawn, Nov 29 '22 at 18:34
@Shawn to my knowledge it isn't true that normalizing each code point in a 5000 code points string will result in a normalized string with 5000 code points, because normalizing usually works on more than one code point to come up with another order or another amount of code points depending on which kind of normalization is used. — Claudio, Nov 29 '22 at 19:45
@Claudio, if you look at https://www.compart.com/en/unicode/U+004b and click on some of those letters that are based on K (generally the ones without extra marks or accents), you'll find quite a few show their decomposition as the ASCII K. The length of the string doesn't really come into play here. — paxdiablo, Nov 30 '22 at 12:31

Python3 surprising behavior of identifier being a non-ASCII Unicode character

1 Answers1