CJK full-width characters as Python names: how does Python deal with it and is it common in other programming languages?

Question

Not only ASCII, but also other Unicode characters can be used as names in Python. For example:

my_variable = 'var 1'  # Normal ASCII characters as name
我的变量 = 'var 2'  # Chinese characters as name
print(my_variable)
print(我的变量)

The code above generates output normally:

var 1
var 2

But in CJK (Chinese, Japanese, Korean) characters, there are a set of special characters which look like ASCII characters but have totally different code. For example:

Character A is UTF-8 \x41 and Unicode \u0041,
Character Ａ is UTF-8 \xEF\xBC\xA1 and Unicode \uFF21.

From the point of view of human, A and Ａ are similar, but from that of computer, they are totally different characters.

Based on this understanding, I thought the following code:

my_var = 'var 1'  # Name with normal ASCII characters
ｍｙ＿ｖａｒ = 'var 2'  # Name with CJK full-width characters

print(my_var)
print(ｍｙ＿ｖａｒ)

would print 'var 1' and 'var 2', but the actual result is:

var 2
var 2

By printing "locals()", it seems that Python automatically converted the CJK full-width characters to the corresponding ASCII characters.

My questions are:

Why does Python convert them automatically? Is there any PEP or issue discussion about that? I searched it but didn't get an answer.
Does Python automatically convert such characters in other areas? I've tested that in dict, 'my_var' and 'ｍｙ＿ｖａｒ' are different keys, but what else?
Is it a normal behavior in programming language design? For example C, Java, JavaScript, PHP, etc.?

Despite I've never used CJK full-width as variable name in my daily programming, but I want to know how does Python deal with such characters: in what circumstances 'my_var' and 'ｍｙ＿ｖａｒ' are considered as same, and in what circumstances they are not.

I don't know the answer for Python specifically, but it's likely that it applies a well-known Unicode normalization algorithm. And if the double-width characters are treated as equivalent to their "normal" variants, then it's probably [NFKC or NFKD](https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c). The [report that defines those](https://unicode.org/reports/tr15/) is suprisingly readable, IMO, especially the introduction that gives a high-level understanding of what's happening. — Joachim Sauer, Apr 19 '23 at 13:40
Java doesn't do any such normalization, as explained [in the JLS](https://docs.oracle.com/javase/specs/jls/se20/html/jls-3.html#jls-3.8): "Two identifiers are the same only if, after ignoring characters that are ignorable, the identifiers have the same Unicode character for each letter or digit. " (ignorable characters are some control characters and FORMAT type characters). — Joachim Sauer, Apr 19 '23 at 13:47
This might be worth reading: https://docs.python.org/3/reference/lexical_analysis.html#identifiers — matszwecja, Apr 19 '23 at 13:47

CJK full-width characters as Python names: how does Python deal with it and is it common in other programming languages?

0 Answers0