Regular expression and unicode literals

Question

I'd like to remove some characters from a string (either byte string or unicode string) using a regular expression like this:

pattern = re.compile(ur'\u00AE|\u2122', re.UNICODE)

If the characters are specified as unicode literals the resulting regexp does not work properly on byte string.

q = 'Canon\xc2\xae  EOS  7D'
pattern.sub('', q)  # 'Canon\xc2  EOS  7D'

If I convert the argument of the substitution to a unicode string, however, it works as expected...

pattern.sub('', unicode(q))  # u'Canon  EOS  7D'

Can someone please explain to me why this is the case?

thanks,

Peter

score 2 · Accepted Answer · answered Nov 23 '11 at 16:28

2

Because a standard (byte) string is not a Unicode string. Python does not know what encoding it's in (or if it's even Unicode at all!), and so has no way to determine whether a particular Unicode character matches some character in it. The solution is to tell Python it's Unicode, using the unicode() function, as you have figured out.

answered Nov 23 '11 at 16:28

kindall

178,883
35
278
309

Thanks for the clarification - so the take away message is "don't mix str and unicode strings (unless you're Dutch)"? – Peter Prettenhofer Nov 24 '11 at 11:35
I think that, if you're Dutch, you just change the language so that all strings are Unicode by default. :-) – kindall Nov 28 '11 at 23:38

Regular expression and unicode literals

1 Answers1