Non-ASCII Python identifiers and reflectivity

Question

I have learnt from PEP 3131 that non-ASCII identifiers were supported in Python, though it's not considered best practice.

However, I get this strange behaviour, where my identifier (U+1D70F) seems to be automatically converted to τ (U+03C4).

class Base(object):
    def __init__(self):
        self. = 5 # defined with U+1D70F

a = Base()
print(a.)     # 5             # (U+1D70F)
print(a.τ)     # 5 as well     # (U+03C4) ? another way to access it?
d = a.__dict__ # {'τ':  5}     # (U+03C4) ? seems converted
print(d['τ'])  # 5             # (U+03C4) ? consistent with the conversion
print(d[''])  # KeyError: '' # (U+1D70F) ?! unexpected!

Is that expected behaviour? Why does this silent conversion occur? Does it have anything to see with NFKC normalization? I thought this was only for canonically ordering Unicode character sequences...

Does [defining an encoding](https://www.python.org/dev/peps/pep-0263/) make a difference? 03C4 is definitely the decomposition of 1D70F, and it looks from [the reference](https://docs.python.org/3/reference/lexical_analysis.html#identifiers) like some normalization happens. — jonrsharpe, Jan 02 '18 at 15:03
Your theory seems to be correct. Seems that python interpreter normalises your unicode variable already when assigning it. If you put `print(dir(a))` after `a` has been assigned, you can see there is no trace of U+1D70F character in the class. Your second print statement would then work for the same reason (gets normalised), while your dictionary access fails as dictionaries can take anything as keywords and there would be no reason to normalise or do anything else to them as it is a string in parentheses. — Hannu, Jan 02 '18 at 15:06
@jonrsharpe Nope, defining `# -*- coding: utf-8 -*-` makes no difference. Maybe NFKC is responsible.. but I thought canonisation was just about *reordering*, not changing the actual character.. 8) — iago-lito, Jan 02 '18 at 15:13
@Hannu I guess you're right as well.. but it leads to a quite unexpected behaviour when it comes to indexing `__dict__`, don't you find? — iago-lito, Jan 02 '18 at 15:14
Not at all. As the answer explains, there is no automatic normalisation of string literals, and It would be completely inappropriate to do so anyway. — Hannu, Jan 02 '18 at 15:32

jonrsharpe · Accepted Answer · 2018-01-02T15:15:57.623

11

Per the documentation on identifiers:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

You can see that U+03C4 is the appropriate result using unicodedata:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', '')
'τ'

However, this conversion doesn't apply to string literals, like the one you're using as a dictionary key, hence it's looking for the unconverted character in a dictionary that only contains the converted character.

self. = 5  # implicitly converted to "self.τ = 5"
a.  # implicitly converted to "a.τ"
d['']  # not converted

You can see similar problems with e.g. string literals used with getattr:

>>> getattr(a, '')
Traceback (most recent call last):
  File "python", line 1, in <module>
AttributeError: 'Base' object has no attribute ''
>>> getattr(a, unicodedata.normalize('NFKD', ''))
5

edited Jan 02 '18 at 15:15

answered Jan 02 '18 at 15:10

jonrsharpe

115,751
26
228
437

Well, that's interesting. Cheers :) I'll keep thinking that it's a weird behaviour anyway. If `` was the only character I could access on my keyboard, I couldn't use python reflective `__dict__` or `getattr` features like anybody else.. Should I file this as a bug to python? – iago-lito Jan 02 '18 at 15:17
@iago-lito I'm not sure they'd consider it a bug, given that this is the documented behaviour. It certainly surprised me, though! And it makes dynamic attribute access (see the `getattr` example) a little more complex than initially expected. I guess this is why ASCII identifiers are still recommended; no more `from math import pi as π` for me! – jonrsharpe Jan 02 '18 at 15:18
I'll inform them anyway :) What's the best place to do so? – iago-lito Jan 02 '18 at 15:20
@iago-lito anything like that should go through https://bugs.python.org/; have a look around, there may be a similar issue logged already. – jonrsharpe Jan 02 '18 at 15:20
Great. [Here](https://bugs.python.org/issue32483) it is. Thanks again :) – iago-lito Jan 02 '18 at 16:09
@iago-lito I'd guess it'll get closed against e.g. https://bugs.python.org/issue13793 – jonrsharpe Jan 02 '18 at 16:13
[Crab](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Liocarcinus_marmoreus_2.jpg/290px-Liocarcinus_marmoreus_2.jpg)! Missed that one :\ You're right. – iago-lito Jan 02 '18 at 16:16

Non-ASCII Python identifiers and reflectivity

1 Answers1

Linked