In order to match the Unicode word boundaries [as defined in the Annex #29] in Python, I have been using the regex
package with flags regex.WORD | regex.V1
(regex.UNICODE
should be default since the pattern is a Unicode string) in the following way:
>>> s="here are some words"
>>> regex.findall(r'\w(?:\B\S)*', s, flags = regex.V1 | regex.WORD)
['here', 'are', 'some', 'words']
It works well in this rather simple cases. However, I was wondering what is the expected behavior in case the input string contains certain punctuation. It seems to me that WB7 says that for example the apostrophe in x'z
does not qualify as a word boundary which seems to be indeed the case:
>>> regex.findall(r'\w(?:\B\S)*', "x'z", flags = regex.V1 | regex.WORD)
["x'z"]
However, if there is a vowel, the situation changes:
>>> regex.findall(r'\w(?:\B\S)*', "l'avion", flags = regex.V1 | regex.WORD)
["l'", 'avion']
This would suggest that the regex module implements the rule WB5a
mentioned in the standard in the Notes section. However, this rule also says that the behavior should be the same with \u2019
(right single quotation mark) which I can't reproduce:
>>> regex.findall(r'\w(?:\B\S)*', "l\u2019avion", flags = regex.V1 | regex.WORD)
['l’avion']
Moreover, even with "normal" apostrophe, a ligature (or y
) seems to behave as a "non-vowel":
>>> regex.findall(r'\w(?:\B\S)*', "l'œil", flags = regex.V1 | regex.WORD)
["l'œil"]
>>> regex.findall(r'\w(?:\B\S)*', "J'y suis", flags = regex.V1 | regex.WORD)
["J'y", 'suis']
Is this the expected behavior? (all examples above were executed with regex 2.4.106 and Python 3.5.2)