Matching Unicode word boundaries in Python

Question

In order to match the Unicode word boundaries [as defined in the Annex #29] in Python, I have been using the regex package with flags regex.WORD | regex.V1 (regex.UNICODE should be default since the pattern is a Unicode string) in the following way:

>>> s="here are some words"
>>> regex.findall(r'\w(?:\B\S)*', s, flags = regex.V1 | regex.WORD)
['here', 'are', 'some', 'words']

It works well in this rather simple cases. However, I was wondering what is the expected behavior in case the input string contains certain punctuation. It seems to me that WB7 says that for example the apostrophe in x'z does not qualify as a word boundary which seems to be indeed the case:

>>> regex.findall(r'\w(?:\B\S)*', "x'z", flags = regex.V1 | regex.WORD)
["x'z"]

However, if there is a vowel, the situation changes:

>>> regex.findall(r'\w(?:\B\S)*', "l'avion", flags = regex.V1 | regex.WORD)
["l'", 'avion']

This would suggest that the regex module implements the rule WB5a mentioned in the standard in the Notes section. However, this rule also says that the behavior should be the same with \u2019 (right single quotation mark) which I can't reproduce:

>>> regex.findall(r'\w(?:\B\S)*', "l\u2019avion", flags = regex.V1 | regex.WORD)
['l’avion']

Moreover, even with "normal" apostrophe, a ligature (or y) seems to behave as a "non-vowel":

>>> regex.findall(r'\w(?:\B\S)*', "l'œil", flags = regex.V1 | regex.WORD)
["l'œil"]
>>> regex.findall(r'\w(?:\B\S)*', "J'y suis", flags = regex.V1 | regex.WORD)
["J'y", 'suis']

Is this the expected behavior? (all examples above were executed with regex 2.4.106 and Python 3.5.2)

revo · Accepted Answer · 2016-08-28T08:22:05.330

1- RIGHT SINGLE QUOTATION MARK ’ seems to be just simply missed in source file:

/* Break between apostrophe and vowels (French, Italian). */
/* WB5a */
if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&
  is_unicode_vowel(char_at(state->text, text_pos)))
    return TRUE;

2- Unicode vowels are determined with is_unicode_vowel() function which translates to this list:

a, à, á, â, e, è, é, ê, i, ì, í, î, o, ò, ó, ô, u, ù, ú, û

So a LATIN SMALL LIGATURE OE œ character is not considered as a unicode vowel:

Py_LOCAL_INLINE(BOOL) is_unicode_vowel(Py_UCS4 ch) {
#if PY_VERSION_HEX >= 0x03030000
    switch (Py_UNICODE_TOLOWER(ch)) {
#else
    switch (Py_UNICODE_TOLOWER((Py_UNICODE)ch)) {
#endif
    case 'a': case 0xE0: case 0xE1: case 0xE2:
    case 'e': case 0xE8: case 0xE9: case 0xEA:
    case 'i': case 0xEC: case 0xED: case 0xEE:
    case 'o': case 0xF2: case 0xF3: case 0xF4:
    case 'u': case 0xF9: case 0xFA: case 0xFB:
        return TRUE;
    default:
        return FALSE;
    }
}

This bug is now fixed in regex 2016.08.27 after a bug report. [_regex.c:#1668]

Point 1 sure looks like a bug, and I suggest you report it, if you haven't already. It's hard to say about point 2. Searching for "vowel" on unicode.org gives a number of hits for various Asian languages, but nothing about French or Italian. The OP's examples certainly seem correct, but I can't see that Annex 29 addresses them specifically. — saulspatz, Aug 24 '16 at 22:38

Matching Unicode word boundaries in Python

1 Answers1