3

Since it has become possible to use unicode characters in identifiers for class, methods, variables, I use them more and more. I don't know, if this is A Good Idea, but it makes the code more readable (e.g. you can now use import numpy as np; π = np.pi; area = r**2 * π!)

Now I noticed the following behaviour (in Python 3.8.5):

I can define a class A the following way:

>>> class A:
...     def x(self):
...         print('x')
...     def ξ(self):
...         print('ξ')
...     def yₓ(self):
...         print('yₓ')

and can access all methods:

>>> a = A()
>>> a.x()
x
>>> a.ξ()
ξ
>>> a.yₓ()
yₓ

The problem arises, if I want to use getattr() to access them:

>>> attr = getattr(a, 'x')
>>> attr()
x
>>> attr = getattr(a, 'ξ')
>>> attr()
ξ
>>> attr = getattr(a, 'yₓ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'A' object has no attribute 'yₓ'

'A' object has no attribute 'yₓ'
  1. Why does getattr(a,'ξ') work, but getattr(a, 'yₓ') does not?

I noticed

>>> dir(a)
[…, 'x', 'yx', 'ξ']
  1. Why is 'ξ' kept, but 'yₓ' silently converted to 'yx'? Which are the "safe" characters, which can be used, so that getattr() succeeds?

  2. Is there a way, so that I can use yₓ?

BTW, yₓ can be used, but y₂ gives a SyntaxError: invalid character in identifier

  1. Why can't I use y₂ at all?

I know, the workaround is, to not use any of those fancy characters, but some of them make the code really more readable (at least in my view!) …

sphh
  • 35
  • 6
  • Does this answer your question? [Identifier normalization: Why is the micro sign converted into the Greek letter mu?](https://stackoverflow.com/questions/34097193/identifier-normalization-why-is-the-micro-sign-converted-into-the-greek-letter) – Green Cloak Guy Dec 01 '20 at 15:23
  • As a general rule, ask yourself: "Is this character part of the native script for a non-English language?" If the answer is "no", tread carefully. – chepner Dec 01 '20 at 15:28

1 Answers1

3

Non-ASCII identifiers are defined in PEP 3131. In it, it says that:

The entire UTF-8 string is passed to a function to normalize the string to NFKC

You can test this for yourself with unicodedata.normalize:

unicodedata.normalize("NFKC", 'ξ') # 'ξ'
unicodedata.normalize("NFKC", 'yₓ') # 'yx'

NFKC is very complicated, but you should be able to find safe characters with a loop.

Aplet123
  • 33,825
  • 1
  • 29
  • 55
  • This answer together with the link provided by @green-cloak-guy in the question's comment (section)[https://stackoverflow.com/questions/65093243/getattr-and-unicode-attributes#comment-115079147] explains, what is going on: Python normalizes identifiers when parsing the script. Thus `.yₓ` actually becomes `.yx`, whereas `.ξ` stays `.ξ`. `getattr()` does not normalize the attribute string, resulting in this puzzling behaviour. Hence: ```python3 class A: def yₓ(self): return 'yₓ' def yx(self): return 'yyxx' a = A() a.yₓ() ``` prints `yyxx` and not `yₓ`. – sphh Dec 02 '20 at 18:56
  • I wonder, why Python normalizes the identifiers. Why doesn't it just use the unicode character(s) as identifier? – sphh Dec 02 '20 at 18:57