Special letter characters
Python 3
If you're using Python3, you might not have to do anything. \w
already includes many "special characters" :
>>> import re
>>> re.findall('\w', 'üäößéÅßêèiìí')
['ü', 'ä', 'ö', 'ß', 'é', 'Å', 'ß', 'ê', 'è', 'i', 'ì', 'í']
Python 2.7
In Python2.7, only i
would be matched by default \w
:
>>> import re
>>> re.findall('\w', 'üäößéÅßêèiìí')
['i']
You could use re.UNICODE
:
# encoding: utf-8
import re
any_char = re.compile('\w', re.UNICODE)
re.findall(any_char, u'üäößéÅßêèiìí')
# [u'\xfc', u'\xe4', u'\xf6', u'\xdf', u'\xe9', u'\xc5', u'\xdf', u'\xea', u'\xe8', u'i', u'\xec', u'\xed']
for x in re.findall(any_char, u'üäößéÅßêèiìí'):
print x
# ü
# ä
# ö
# ß
# é
# Å
# ß
# ê
# è
# i
# ì
# í
Any special character
Specifying unicode ranges might simplify your regex. As an example, this regex match any unicode arrow :
>>> import re
>>> arrows = re.compile(r'[\u2190-\u21FF]')
>>> re.findall(arrows, "a⇸b⇙c↺d↣e↝f")
['⇸', '⇙', '↺', '↣', '↝']
For Python2, you'd need to specify unicode string and regex :
>>> import re
>>> arrows = re.compile(ur'[\u2190-\u21FF]')
>>> re.findall(arrows, u"a⇸b⇙c↺d↣e↝f")
[u'\u21f8', u'\u21d9', u'\u21ba', u'\u21a3', u'\u219d']