I am extracting HTML from some webpage with Unicode characters as follows:
def extract(url):
""" Adapted from Python3_Google_Search.py """
user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/525.13 (KHTML, like Gecko)"
"Chrome/0.2.149.29 Safari/525.13")
request = urllib.request.Request(url)
request.add_header("User-Agent",user_agent)
response = urllib.request.urlopen(request)
html = response.read().decode("utf8")
return html
I am decoding properly as you can see. So html
is now a unicode string. When printing html, I can see the Unicode characters.
I am using html.parser
to parse the HTML and subclassed it:
from html.parser import HTMLParser
class Parser(HTMLParser):
def __init__(self):
## some init stuff
#### rest of class
When parsing out the HTML using the class's handle_data
, it appears that the Unicode characters are removed/suddenly disappear. The docs do not mention anything about encodings. Why does HTML Parser remove non-ascii characters, and how can I fix such an issue?