Some of them are ASCII, some aren't. You can look up the meanings here for HTML 4 (or similar URLs for HTML5, XHTML 4, etc.). That table gives you the Unicode code point for each entity; Unicode code points 0-127 correspond to ASCII characters 0-127, and Unicode code points 128+ are non-ASCII.
For the ones that are non-ASCII, you have to decide what to replace them with before you can write code to replace them.
In particular:
—
is —
, U+2014, non-ASCII, usually replaced by --
.
–
is –
, U+2013, non-ASCII, usually replaced by -
.
§
is §
, U+00A7, non-ASCII; no common replacement, so you'll have to pick something, maybe "sect. "
?
 
is a non-breaking space, U+00A0, non-ASCII, usually replaced by a space.
"
is "
, U+0022, already ASCII.
One way to substitute these is by using the str.replace
method. For example:
h = h.replace('—', '--').replace('–', '-')
h = h.replace('§', 'sect. ').replace(' ', ' ')
However, I think you'd be better off converting to unescaped Unicode, then using str.translate
(or unicode.translate
, if this is Python 2.x) to map the characters. A translation table gets a lot simpler (and more efficient, if that matters) than a long chain of replace
calls once you have more than about 4 characters to deal with. And that way, you'll also handle things like unescaped em-dashes, or other characters you hadn't noticed. For example:
h = html.unescape(h)
table = {0x2013: '-', 0x2014: '--', 0x00a7: 'sect. ', 0x00A0: ' '}
h = h.translate(table)
h.encode('ascii') # forces an exception if you missed any non-ASCII chars