Not only ASCII, but also other Unicode characters can be used as names in Python. For example:
my_variable = 'var 1' # Normal ASCII characters as name
我的变量 = 'var 2' # Chinese characters as name
print(my_variable)
print(我的变量)
The code above generates output normally:
var 1
var 2
But in CJK (Chinese, Japanese, Korean) characters, there are a set of special characters which look like ASCII characters but have totally different code. For example:
- Character A is UTF-8 \x41 and Unicode \u0041,
- Character A is UTF-8 \xEF\xBC\xA1 and Unicode \uFF21.
From the point of view of human, A and A are similar, but from that of computer, they are totally different characters.
Based on this understanding, I thought the following code:
my_var = 'var 1' # Name with normal ASCII characters
my_var = 'var 2' # Name with CJK full-width characters
print(my_var)
print(my_var)
would print 'var 1' and 'var 2', but the actual result is:
var 2
var 2
By printing "locals()", it seems that Python automatically converted the CJK full-width characters to the corresponding ASCII characters.
My questions are:
- Why does Python convert them automatically? Is there any PEP or issue discussion about that? I searched it but didn't get an answer.
- Does Python automatically convert such characters in other areas? I've tested that in dict, 'my_var' and 'my_var' are different keys, but what else?
- Is it a normal behavior in programming language design? For example C, Java, JavaScript, PHP, etc.?
Despite I've never used CJK full-width as variable name in my daily programming, but I want to know how does Python deal with such characters: in what circumstances 'my_var' and 'my_var' are considered as same, and in what circumstances they are not.