I'm using BeautifulSoup to parse some web pages.
Occasionally I run into a "unicode hell" error like the following :
Looking at the source of this article on TheAtlantic.com [ http://www.theatlantic.com/education/archive/2013/10/why-are-hundreds-of-harvard-students-studying-ancient-chinese-philosophy/280356/ ]
We see this in the og:description meta property :
<meta property="og:description" content="The professor who teaches Classical Chinese Ethical and Political Theory claims, "This course will change your life."" />
When BeautifulSoup parses it, I see this:
>>> print repr(description)
u'The professor who teaches\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'
If I try encoding it to UTF-8 , like this SO comment suggests : https://stackoverflow.com/a/10996267/442650
>>> print repr(description.encode('utf8'))
'The professor who teaches\xc2\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'
Just when I thought I had all my unicode issues under control, I still don't quite understand what's going on, so I'm going to lay out a few questions:
1- why would BeautifulSoup convert the
to \xa0
[a latin charset space character]? The charset and headers on this page are UTF-8, I thought BeautifulSoup pulls that data for the encoding ? Why wasn't it replaced with a <space>
?
2- Is there a common way to normalize whitespaces for conversion ?
3- When I encoded to UTF8 , where did \xa0
become the sequence of \xc2\xa0
?
I can pipe everything through unicodedata.normalize('NFKD',string)
to help get me where I want to be -- but I'd love to understand what's wrong and avoid problem like this in the future.