0

I am extracting HTML from some webpage with Unicode characters as follows:

def extract(url):
     """ Adapted from Python3_Google_Search.py """
     user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
                   "AppleWebKit/525.13 (KHTML,     like Gecko)"
                   "Chrome/0.2.149.29 Safari/525.13")
     request = urllib.request.Request(url)
     request.add_header("User-Agent",user_agent)
     response = urllib.request.urlopen(request)
     html = response.read().decode("utf8")
     return html

I am decoding properly as you can see. So html is now a unicode string. When printing html, I can see the Unicode characters.

I am using html.parser to parse the HTML and subclassed it:

from html.parser import HTMLParser
class Parser(HTMLParser):
  def __init__(self):
    ## some init stuff
  #### rest of class

When parsing out the HTML using the class's handle_data, it appears that the Unicode characters are removed/suddenly disappear. The docs do not mention anything about encodings. Why does HTML Parser remove non-ascii characters, and how can I fix such an issue?

darksky
  • 20,411
  • 61
  • 165
  • 254
  • what program/tool are you using to view the output? – mechanical_meat May 03 '13 at 16:59
  • 1. Are you 100% certain that the data your script receives has the characters there, and 2. how are you verifying that the non-ascii characters have 'disappeared'? – Martijn Pieters May 03 '13 at 16:59
  • I used Emacs in Terminal (has Unicode encoding on) and then again Mac TextEdit. – darksky May 03 '13 at 16:59
  • @MartijnPieters, when I print `html` before returning in the `extract` function, I see this: `Österreich`. So yes, I'm 100% certain my script received the right unicode characters. I am verifying that the unicode characters have disappeared by opening the text file I wrote out to and seeing them not there. – darksky May 03 '13 at 17:02
  • 1
    @Darksky: Those are HTML escape codes, using *only* ASCII characters. Something else is removing those, this has nothing to do with Python so far. `Ö` is 6 characters, an ampersand, a capital `O`, lowercase `u`, `m` and `l`, then a semicolon. – Martijn Pieters May 03 '13 at 17:04
  • @MartijnPieters I found out the problem. I am using this html to feed into a `html.parser` subclass, using: `parser.feed(html)`. They are disappearing there. – darksky May 03 '13 at 17:05
  • @Darksky: Then update your question to be about *that subclass*. – Martijn Pieters May 03 '13 at 17:05
  • Do you call `super().__init__()` at all in your custom subclass? – Martijn Pieters May 03 '13 at 17:18
  • Nope. Do I have to? I think I figured it out. I'm writing a response. Python's documentation is so, so bad it's ridiculous. – darksky May 03 '13 at 17:21
  • @Darksky: Also, the *rest* of the custom parser might very well be relevant here. – Martijn Pieters May 03 '13 at 17:28

1 Answers1

0

Apparently, html.parser will call handle_entityref whenever it encounters a non-ascii character. It passes the named character reference, and to convert that to the unicode character, I used:

html.entities.html5[name]

Python's documentation does not mention that. I've never seen worse documentation that Python.

darksky
  • 20,411
  • 61
  • 165
  • 254