2

I got this stubborn EM DASH character that I'm trying to remove using regex, but for some reason I can't get it to work. This is the code I'm using.

editedSource = re.sub(r'\u2014','',str(source))

What am I doing wrong here? I'm pretty sure I got the right character code. Here's the character:

and it shows up like

—

.

Thanks!

Nils
  • 89
  • 1
  • 9
  • 1
    Is this Python 2 or 3? – David Ehrmann Nov 26 '15 at 04:47
  • str converts something to a byte-string. You probably want to do the substitution on a *unicode* string. – David Ehrmann Nov 26 '15 at 04:48
  • Handling unicode with Python 2 is tricky. Make sure `type(source)` is `unicode`, then don't use `str` on it. In Python 2, str is more like a byte array than a character string, so what you're seeing there are UTF-8 bytes printed as extended ASCII. – David Ehrmann Nov 26 '15 at 04:52

3 Answers3

5

Prep-end regex pattern with u to tell regex engine that parse the unicode and do not try to cast unicode into str.

>>>source = u'hello\u2014world'
>>>re.sub(ur'\u2014','',source)
>>>u'helloworld'
Learner
  • 5,192
  • 1
  • 24
  • 36
1
>>> source = u'hello\u2014world'
>>> print source
hello—world
>>> import re
>>> re.sub(u'\u2014','',source)
u'helloworld'

Note, you can remove/replace individual unicode characters more efficiently with a mapping like this

>>> source.translate({0x2014: None})
u'helloworld'
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
0

The following code works in Python 3:

editedSource = re.sub('\—','',str(source))
ekhumoro
  • 115,249
  • 20
  • 229
  • 336
Terminator17
  • 782
  • 1
  • 6
  • 13