-1

I have a project where I need to "replace all non-ASCII characters (in a html) with ASCII equivalents wherever it is possible".

I am just wondering: are characters in the title non-ascii or ascii?

If they are non-ascii, how do I convert them into ascii using Pyhton? Thanks!

sobolevn
  • 16,714
  • 6
  • 62
  • 60
Jay
  • 658
  • 2
  • 12
  • 20
  • They're non-ascii. Just look at an ascii-table - none of these characters are listed there – Eric May 24 '15 at 18:09
  • Try searching Google for ASCII and paying attention to character codes (numeric values). There's your answer. – elixenide May 24 '15 at 18:11
  • 2
    It's not clear to me that your title correctly represents your task. Since html can contain the *string* `—` and all seven of those characters are ascii, are you sure you need to replace anything there? – Stefan Pochmann May 24 '15 at 18:27

1 Answers1

4

Some of them are ASCII, some aren't. You can look up the meanings here for HTML 4 (or similar URLs for HTML5, XHTML 4, etc.). That table gives you the Unicode code point for each entity; Unicode code points 0-127 correspond to ASCII characters 0-127, and Unicode code points 128+ are non-ASCII.

For the ones that are non-ASCII, you have to decide what to replace them with before you can write code to replace them.

In particular:

  • — is , U+2014, non-ASCII, usually replaced by --.
  • – is , U+2013, non-ASCII, usually replaced by -.
  • § is §, U+00A7, non-ASCII; no common replacement, so you'll have to pick something, maybe "sect. "?
  •   is a non-breaking space, U+00A0, non-ASCII, usually replaced by a space.
  • " is ", U+0022, already ASCII.

One way to substitute these is by using the str.replace method. For example:

h = h.replace('—', '--').replace('–', '-')
h = h.replace('§', 'sect. ').replace(' ', ' ')

However, I think you'd be better off converting to unescaped Unicode, then using str.translate (or unicode.translate, if this is Python 2.x) to map the characters. A translation table gets a lot simpler (and more efficient, if that matters) than a long chain of replace calls once you have more than about 4 characters to deal with. And that way, you'll also handle things like unescaped em-dashes, or other characters you hadn't noticed. For example:

h = html.unescape(h)
table = {0x2013: '-', 0x2014: '--', 0x00a7: 'sect. ', 0x00A0: ' '}
h = h.translate(table)
h.encode('ascii') # forces an exception if you missed any non-ASCII chars
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I think you just did the assignment for the OP – Padraic Cunningham May 24 '15 at 18:43
  • @PadraicCunningham: Fortunately, if this really is a homework assignment and he's too clueless or lazy to even start on the problem himself or try to understand the answer, I'd say there's a good chance he'll fail because he tries to run it on bytes rather than unicode and it doesn't work, or because he doesn't know how to extend from "such characters as…" to all of the characters the assignment includes… (While if he's really trying to solve the problem, he should hopefully have no trouble.) – abarnert May 24 '15 at 18:53
  • Another possible interpretation is "which of these characters should be converted into entities", and another, are the entity codes themselves ASCII (which of course they are). Nominating to close as unclear, but definitely upvote this attempted answer. – tripleee May 24 '15 at 19:49
  • @tripleee: Your first alternate interpretation doesn't seem very likely (why not just call `html.escape`, and who cares if it entity-ifies a few things that didn't need to be? and besides, things like `"` often _do_ need to be entity-ified even though they're already ASCII…), but your second one, yeah, you're right, that could definitely be a way to read this question. – abarnert May 24 '15 at 19:52