Trying to remove character EM DASH "—" (â€”) in python using regex

Question

I got this stubborn EM DASH character that I'm trying to remove using regex, but for some reason I can't get it to work. This is the code I'm using.

editedSource = re.sub(r'\u2014','',str(source))

What am I doing wrong here? I'm pretty sure I got the right character code. Here's the character:

—

and it shows up like

â€”

.

Thanks!

str converts something to a byte-string. You probably want to do the substitution on a *unicode* string. — David Ehrmann, Nov 26 '15 at 04:48
Handling unicode with Python 2 is tricky. Make sure `type(source)` is `unicode`, then don't use `str` on it. In Python 2, str is more like a byte array than a character string, so what you're seeing there are UTF-8 bytes printed as extended ASCII. — David Ehrmann, Nov 26 '15 at 04:52

Learner · Answer 1 · 2015-11-26T05:15:13.590

5

Prep-end regex pattern with u to tell regex engine that parse the unicode and do not try to cast unicode into str.

>>>source = u'hello\u2014world'
>>>re.sub(ur'\u2014','',source)
>>>u'helloworld'

edited Nov 26 '15 at 05:15

answered Nov 26 '15 at 04:59

Learner

score 1 · Answer 2 · answered Nov 26 '15 at 04:52

>>> source = u'hello\u2014world'
>>> print source
hello—world
>>> import re
>>> re.sub(u'\u2014','',source)
u'helloworld'

Note, you can remove/replace individual unicode characters more efficiently with a mapping like this

>>> source.translate({0x2014: None})
u'helloworld'

score 0 · Answer 3 · edited Dec 07 '17 at 03:25

0

The following code works in Python 3:

editedSource = re.sub('\â€”','',str(source))

edited Dec 07 '17 at 03:25

ekhumoro

answered Dec 07 '17 at 03:11

Terminator17

3 Answers3