I'd like to remove some characters from a string (either byte string or unicode string) using a regular expression like this:
pattern = re.compile(ur'\u00AE|\u2122', re.UNICODE)
If the characters are specified as unicode literals the resulting regexp does not work properly on byte string.
q = 'Canon\xc2\xae EOS 7D'
pattern.sub('', q) # 'Canon\xc2 EOS 7D'
If I convert the argument of the substitution to a unicode string, however, it works as expected...
pattern.sub('', unicode(q)) # u'Canon EOS 7D'
Can someone please explain to me why this is the case?
thanks,
Peter