2

Long story short:

>>> re.compile(r"\w*").match(u"Français")
<_sre.SRE_Match object at 0x1004246b0>
>>> re.compile(r"^\w*$").match(u"Français")
>>> re.compile(r"^\w*$").match(u"Franais")
<_sre.SRE_Match object at 0x100424780>
>>> 

Why doesn't it match the string with unicode characters with ^ and $ in the regex? As far as I understand ^ stands for the beginning of the string(line) and $ - for the end of it.

tchrist
  • 78,834
  • 30
  • 123
  • 180
ak.
  • 3,329
  • 3
  • 38
  • 50

1 Answers1

5

You need to specify the UNICODE flag, otherwise \w is just equivalent to [a-zA-Z0-9_], which does not include the character 'ç'.

>>> re.compile(r"^\w*$", re.U).match(u"Fran\xe7ais")
<_sre.SRE_Match object at 0x101474168>
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • Why does this wort then: `>>> re.compile(r"\w*").match(u"Français")`? – ak. Aug 31 '10 at 08:37
  • @ak: Are you sure the match returns `Français` instead of `Fran` with it? Note that without the `$` the regex won't match until the end. – kennytm Aug 31 '10 at 08:38
  • 1
    `\w*` will match absolutely anything. `*` matches 0 or more times. – Turtle Aug 31 '10 at 08:39