PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching

Question

who know, if it is possible to forbidden regex while macthing splitting code points into surrogate pairs.

See the following example:

How it is now:

$ te = u'\U0001f600\U0001f600'
$ flags1 = regex.findall(".", te, re.UNICODE)
$ flags1
>>> [u'\ud83d', u'\ude00', u'\ud83d', u'\ude00']

My wish:

$ te = u'\U0001f600\U0001f600'
$ flags1 = regex.findall(".", te, re.UNICODE)
$ flags1
>>> [u'\U0001f600', u'\U0001f600']

Why am i actually need it, because i want to iterate over unicode string and get each iteration next unicode character.

See example:

for char in  regex.findall(".", te, re.UNICODE):
   print char

Thx you in advance=)

Which version of Python? I suspect it works fine from Python 3.4 on. I just tested in Python 3.6 and yes, it works great. — Mark Ransom, Aug 17 '18 at 00:51
Same, I'm seeing a list containing two strings, each equal to a single smiley face emoji. — BallpointBen, Aug 17 '18 at 06:05

Mark Tolonen · Accepted Answer · 2018-08-17T06:19:02.713

Use a regular expression that matches a surrogate pair or anything. This will work in wide and narrow builds of Python 2, but isn't needed in a wide build since it doesn't use surrogate pairs.

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print re.findall(ur'[\ud800-\udbff][\udc00-\udfff]|.', te, re.UNICODE)
[u'A', u'\u5200', u'\U0001f600', u'\U0001f601', u'\u5100', u'Z']

This will still work in the latest Python 3, but also isn't needed because surrogate pairs are no longer used in Unicode strings (no wide or narrow build anymore):

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print(re.findall(r'[\ud800-\udbff][\udc00-\udfff]|.', te))
['A', '刀', '', '', '儀', 'Z']

Works without the surrogate match:

>>> print(re.findall(r'.', te))
['A', '刀', '', '', '儀', 'Z']

And then you can just iterate normally in Python 3:

>>> for c in te:
...     print(c)
...
A
刀


儀
Z

Note there is still an issue with graphemes (Unicode code point combinations that represent a single character. Here's a bad case:

>>> s = '‍‍‍'
>>> for c in s:
...     print(c)
...     


‍


‍


‍

The regex 3rd party module can match graphemes:

>>> import regex
>>> s = '‍‍‍'
>>> for c in regex.findall('\X',s):
...     print(c)
...     
‍‍‍

cool, thx u=) looks great. it there any possibility for python 2.7 to add graphemes support into the general pattern, which you gave at the beginning of your answer? ---> r'[\ud800-\udbff][\udc00-\udfff]|.' — Egor Savin, Aug 17 '18 at 09:43
and unfortunately your examples with graphemes issue doesn't work for python 2.7-(( — Egor Savin, Aug 17 '18 at 09:45
@Egor Graphemes are complicated and require knowledge of the latest Unicode standard. 2.7 is old and the regex library for it is out of date. — Mark Tolonen, Aug 17 '18 at 14:37

PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching

1 Answers1

Linked