How do I split a unicode string on code points in python? (eg. \u00B7 or \u2022)?

Question

I tried everything I could think of...

1. unicode_obj.split('\u2022')
2. re.split(r'\u2022', unicode_object)
3. re.split(r'(?iu)\u2022', unicode_object)

Nothing worked

The problem is that I want to split on special characters.

example string : u'<special char like middot:\u00b7 or bullet:\u2022> sdfhsdf <repeat special char> sdfjhdgndujhfsgkljng <repeat special char> ... etc'

Please help.

Thanks in advance.

What if you split using `u'\u2022'`? – kennytm Dec 03 '11 at 14:54 — kennytm, Dec 03 '11 at 14:54

score 8 · Accepted Answer · answered Dec 03 '11 at 14:57

8

Consider:

>>> print '\u2022'
\u2022
>>> print len('\u2022')
6
>>> import unicodedata
>>> map(unicodedata.name, '\u2022'.decode('ascii'))
['REVERSE SOLIDUS', 'LATIN SMALL LETTER U', 'DIGIT TWO', 'DIGIT ZERO', 'DIGIT TWO', 'DIGIT TWO']
>>>

vs:

>>> print u'\u2022'
•
>>> print len(u'\u2022')
1
>>> map(unicodedata.name, u'\u2022')
['BULLET']
>>>

This should make the difference between text.split('\u2022') and text.split(u'\u2022') clear.

answered Dec 03 '11 at 14:57

Jean-Paul Calderone

47,755
6
94
122

Thank you for having the patience with me... My head was in quite a mess... Thank you very very much – aniketd Dec 03 '11 at 15:21

How do I split a unicode string on code points in python? (eg. \u00B7 or \u2022)?

1 Answers1