BeautifulSoup findall with class attribute- unicode encode error

Question

I am using BeautifulSoup to extract news stories(just the titles) from Hacker News and have this much up till now-

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles_html =[]

    for td in soup.findAll("td", { "class":"title" }):
        titles_html += td.findAll("a")

    return titles_html

print get_stories(get_page()

)

When I run the code, however, it gives an error-

Traceback (most recent call last):
  File "terminalHN.py", line 19, in <module>
    print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)

How do I get this to work?

score 6 · Accepted Answer · answered Apr 21 '11 at 16:24

Because BeautifulSoup works internally with unicode strings. Printing unicode strings to the console will cause Python to try the conversion of unicode to the default encoding of Python which is usually ascii. This will in general fail for non-ascii web-site. You may learn the basics about Python and Unicode by googling for "python + unicode". Meanwhile convert your unicode strings to utf-8 using

print some_unicode_string.decode('utf-8')

You want `.encode('utf-8')` to convert from a Unicode string to a UTF-8 encoded string. — Alex, Apr 21 '11 at 16:27

score 1 · Answer 2 · answered Apr 21 '11 at 16:35

One thing to note about your code is that findAll returns a list (in this case a list of BeautifulSoup objects) and you just want the titles. You might want to use find instead. And rather than printing out a list of the BeautifulSoup objects, you say that you just want the titles. The following works fine, for example:

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles = []

    for td in soup.findAll("td", { "class":"title" }):
        a_element = td.find("a")
        if a_element:
            titles.append(a_element.string)

    return titles

print get_stories(get_page())

So now get_stories() returns a list of unicode objects, which prints out as you'd expect.

score 0 · Answer 3 · answered Apr 21 '11 at 16:20

0

It works fine, what's broken is the output. Either explicitly encode to your console's charset, or find a different way to run your code (e.g., from within IDLE).

answered Apr 21 '11 at 16:20

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

BeautifulSoup findall with class attribute- unicode encode error

3 Answers3

Linked