0

Not only ASCII, but also other Unicode characters can be used as names in Python. For example:

my_variable = 'var 1'  # Normal ASCII characters as name
我的变量 = 'var 2'  # Chinese characters as name
print(my_variable)
print(我的变量)

The code above generates output normally:

var 1
var 2

But in CJK (Chinese, Japanese, Korean) characters, there are a set of special characters which look like ASCII characters but have totally different code. For example:

  • Character A is UTF-8 \x41 and Unicode \u0041,
  • Character A is UTF-8 \xEF\xBC\xA1 and Unicode \uFF21.

From the point of view of human, A and A are similar, but from that of computer, they are totally different characters.

Based on this understanding, I thought the following code:

my_var = 'var 1'  # Name with normal ASCII characters
my_var = 'var 2'  # Name with CJK full-width characters

print(my_var)
print(my_var)

would print 'var 1' and 'var 2', but the actual result is:

var 2
var 2

By printing "locals()", it seems that Python automatically converted the CJK full-width characters to the corresponding ASCII characters.

My questions are:

  • Why does Python convert them automatically? Is there any PEP or issue discussion about that? I searched it but didn't get an answer.
  • Does Python automatically convert such characters in other areas? I've tested that in dict, 'my_var' and 'my_var' are different keys, but what else?
  • Is it a normal behavior in programming language design? For example C, Java, JavaScript, PHP, etc.?

Despite I've never used CJK full-width as variable name in my daily programming, but I want to know how does Python deal with such characters: in what circumstances 'my_var' and 'my_var' are considered as same, and in what circumstances they are not.

Vespene Gas
  • 3,210
  • 2
  • 16
  • 27
  • I don't know the answer for Python specifically, but it's likely that it applies a well-known Unicode normalization algorithm. And if the double-width characters are treated as equivalent to their "normal" variants, then it's probably [NFKC or NFKD](https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c). The [report that defines those](https://unicode.org/reports/tr15/) is suprisingly readable, IMO, especially the introduction that gives a high-level understanding of what's happening. – Joachim Sauer Apr 19 '23 at 13:40
  • Java doesn't do any such normalization, as explained [in the JLS](https://docs.oracle.com/javase/specs/jls/se20/html/jls-3.html#jls-3.8): "Two identifiers are the same only if, after ignoring characters that are ignorable, the identifiers have the same Unicode character for each letter or digit. " (ignorable characters are some control characters and FORMAT type characters). – Joachim Sauer Apr 19 '23 at 13:47
  • This might be worth reading: https://docs.python.org/3/reference/lexical_analysis.html#identifiers – matszwecja Apr 19 '23 at 13:47

0 Answers0