8

I have a list of html pages which may contain certain encoded characters. Some examples are as below -

<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>

I would like to decode (escape, I'm unsure of the current terminology) these strings to -

 <a href="mailto:lad at maestro dot com">
<em>ada@graphics.maestro.com</em>
<em>mel@graphics.maestro.com</em>

Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.

Edit -

The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a

UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)

error in some cases.

Dexter
  • 11,311
  • 11
  • 45
  • 61

1 Answers1

8

You need to unescape HTML entities, and URL-unquote.
The standard library has HTMLParser and urllib2 to help with those tasks.

import HTMLParser, urllib2

markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>'''

result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
for line in result.split("\n"): 
    print(line)

Result:

<a href="mailto:lad at maestro dot com">
<em>ada@graphics.maestro.com</em>
<em>mel@graphics.maestro.com</em>

Edit:
If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to cp-1252, so let's try decoding from that to Unicode:

import codecs 
with codecs.open(filename, encoding="cp1252") as fin:
    decoded = fin.read()
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
    fou.write(result)

Edit2:
If you don't care about the non-ASCII characters you can simplify a bit:

with open(filename) as fin:
    decoded = fin.read().decode('ascii','ignore')
...
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • While this solution looks good is does throw a UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128) error at some places. – Dexter Mar 25 '12 at 01:04
  • Try using .encode('ascii') on the markup string before feeding it in. – Niall Byrne Mar 25 '12 at 01:15
  • @mcenley: if you post more detail about how you're getting your data we can provide encoding assistance. – mechanical_meat Mar 25 '12 at 01:15
  • @bernie I have a list of html pages downloaded. How should I send them to you? – Dexter Mar 25 '12 at 01:19
  • No no, I believe you. What we'd need is the encoding used for those pages, and how you're reading them. The principle is decode (to Unicode) on input, and encode on output. – mechanical_meat Mar 25 '12 at 01:22
  • @bernie I can upload my Python program on a gist. Can we take this discussion further on chat? – Dexter Mar 25 '12 at 01:26
  • @mcenley: perfect, thanks. I'll update my answer. Give me just a moment. – mechanical_meat Mar 25 '12 at 01:29
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/9266/discussion-between-mcenley-and-bernie) – Dexter Mar 25 '12 at 01:30
  • @NiallByrne The same issue persists - UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 149: ordinal not in range(128) – Dexter Mar 25 '12 at 01:38
  • Not much help to you mcenley- +1 to bernie for taking a look at the encoding of the example file. It's worth the pounding my rep took here to learn a bit about encoding :) +1 on what turned out to be a pretty interesting question. – Niall Byrne Mar 25 '12 at 01:48
  • @bernie Thanks Bernie! I must sure look into encoding a bit more. IT's really confusing. – Dexter Mar 25 '12 at 01:58
  • @NiallByrne No issues mate. I hope to keep some more such interesting questions posted. – Dexter Mar 25 '12 at 01:59
  • mcenly: you're most welcome. @NiallByrne: you probably know this: I think you can vote to delete your answer and get the rep back if you're so inclined. Happy coding. – mechanical_meat Mar 25 '12 at 02:01
  • @mcenley: there are various ways to try to do that; however, due to subtle differences between encodings guessing is not typically recommended. – mechanical_meat Mar 25 '12 at 13:25
  • @bernie Thanks! Apologies to cross post but I'm facing an issue with this method of decoding for quotes. I have posted a new question here - http://stackoverflow.com/questions/9860400/accomodate-two-types-of-quotes-in-a-regex – Dexter Mar 25 '12 at 13:32