Python - isalpha() returns True on unicode modifiers

Question

Why does u'\u02c7'.isalpha() return True, if symbol ˇ is not alphabetic? Does this method work properly only with ASCII chars?

[Category "Lm"](http://www.fileformat.info/info/unicode/char/02c7/index.htm) — Ignacio Vazquez-Abrams, Apr 01 '18 at 13:50

Martijn Pieters · Accepted Answer · 2018-04-01T14:02:18.040

5

U+02c7 CARON is a codepoint in the Lm (Modifier Letter) category, so according to the Unicode standard, it is alphabetic.

The documentation for str.isalpha() makes it clear what is included:

Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”.)

You didn't define what you mean by work properly; clearly you have a different definition of what constitutes an alphabetic letter. If you only expected Latin-1 letters, then you need to limit also need to test if the string can be encoded safely to Latin-1. There are exactly zero Lm-category codepoints in the Latin-1 subset of Unicode (and no Lt characters either, and only 2 Lo characters, ª (U+00AA) and º (U+00BA)).

edited Apr 01 '18 at 14:02

answered Apr 01 '18 at 13:50

Martijn Pieters

1,048,767
296
4,058
3,343

Yes, I didn't see that modifiers are also alphabetic letters, though it seems a bit strange to me. Was expecting ASCII letters, or at most unicode letters with umlauts and so on, to be defined as "alphabetic letters" – Kostya Apr 01 '18 at 14:18
@Kostya: You are thinking of Latin-1 letters only then. – Martijn Pieters Apr 01 '18 at 14:20
@Kostya: (there are Latin-derived Lm codepoints, but they live in the 02B0 - 02B8 range, outside of the ISO-8859-1 / Latin-1 range of the Unicode standard, and beyond). – Martijn Pieters Apr 01 '18 at 14:22
The reason: you want to check if a string is an word (made with alphabetic characters). Modifiers just modify such characters, but the outcome is still an (accented) word. Note: just a combining character is not valid unicode string, so you should not get such extreme case. – Giacomo Catenazzi Apr 01 '18 at 16:53
@GiacomoCatenazzi: these are *not combining characters*. There are exactly zero codepoints that are both letters (category `L*`) and have a combining class other than 0 (so they modify other characters). Combining characters are **always** *marks* (category `M*`, *Combining Marks*), never any other category. See [D52 in Chapter 3.6 of the Unicode comformance section](http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#G30602). – Martijn Pieters Apr 01 '18 at 17:32
@GiacomoCatenazzi: and just an isolated combining character *is* a valid Unicode string. It may not make much sense textually, but there are no rules *in Unicode* that state that combining characters can't live on their own. See [Can a combining character be used alone in Unicode?](//stackoverflow.com/q/38126512) and the same section D52: *There may be no such base character [...] In such cases, the combining characters are called __isolated combining characters__*. – Martijn Pieters Apr 01 '18 at 17:44

Python - isalpha() returns True on unicode modifiers

1 Answers1

Linked