0

I'm web scraping this url : http://www.rajtamil.com/category/vijay-tv-shows/

Getting stuck with this error:

    movTitle = str(link['title'])
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 41: ordinal not in range(128)

Here's my code snippet

    rajTamilurl='http://www.rajtamil.com/category/vijay-tv-shows/'
    req = urllib2.Request(rajTamilurl)
    req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
    response = urllib2.urlopen(req)
    link=response.read()
    response.close()

    #Here's what i've tried so far
    #link=link.decode('utf-8')
    #link=link.encode('utf-8','ignore')
    #link=link.decode('ascii', 'ignore')
    #soup = BeautifulSoup(link, from_encoding="utf-8")
    #soup = BeautifulSoup(link.decode('utf-8','ignore'))
    #soup = BeautifulSoup(link, 'html5lib')
    #print soup.prettify()

    soup = BeautifulSoup(link)
    for eachItem in soup.findAll('li'):
        for coveritem in eachItem.findAll("div", { "class":"cover" }):
            links = coveritem.find_all('a')
            for link in links:
                print link['title']
                movTitle = str(link['title'])

Any pointers ?

gbzygil
  • 141
  • 4
  • 16
  • 1
    Why do you need to convert to `str`? It should be already `unicode` which is a better way to process string. – Paulo Bu Feb 21 '14 at 22:26
  • Thanks for your reply. This is small part of a bigger piece of code which is filled with str's all the way(So are the lib's that i'm importing). Will be a pain in the bottom to de-str those, plus i'm not very comfortable with char encodings.... – gbzygil Feb 21 '14 at 22:39
  • Did you have a `# coding: utf-8` line at the beginning of your script? – Hai Vu Feb 21 '14 at 22:45
  • nope. Did not have that at the beginning of the script. Rest of the code is available here for referance : http://pastebin.com/cW7HF8Va – gbzygil Feb 21 '14 at 22:48
  • @gbzygil `# coding: utf-8` is not the problem here, take a look at the answer. – Paulo Bu Feb 21 '14 at 22:52
  • 2
    *"i'm not very comfortable with char encodings.."* -- you can fix it. Most of what you need to know when working with text is: 1. Decode early 2. Unicode everywhere 3. Encode late. That's it. [Explain it like I'm five: Python and Unicode?](http://www.reddit.com/r/Python/comments/1g62eh/explain_it_like_im_five_python_and_unicode/) – jfs Feb 21 '14 at 23:26
  • I'll also recommend Joel's http://www.joelonsoftware.com/articles/Unicode.html. It is also a very pleasant reading. – Paulo Bu Feb 21 '14 at 23:33

1 Answers1

2

Although I strongly don't recommend to work with str, I understand you have some constrains. Try changing this line:

movTitle = str(link['title'])

with this one:

movTitle = link['title'].encode('utf8')

When you encode an unicode string, you get its respective encoded str version.

Hope this helps!

Paulo Bu
  • 29,294
  • 6
  • 74
  • 73
  • Why do you thing `utf8` is appropriate here? We don't know what character encoding or even *multiple* character encodings are used by other parts of the code. Using bytestrings in this case leads to an eventual silent data corruption. – jfs Feb 21 '14 at 23:08
  • @J.F.Sebastian I went to the web page and checked that's why I suggested. I know about that but the OP is decided to work with `str` :( I'm just trying to explain somehow the error to him. – Paulo Bu Feb 21 '14 at 23:10
  • The only part of the code that needs to know and uses the html source character encoding is `BeautifulSoup()`. `link['title']` is a Unicode string (no associated character encoding at this step). How other parts of the code would know about the the html source character encoding? – jfs Feb 21 '14 at 23:20
  • 1
    @J.F.Sebastian I know my answer ain't brilliant. I just tried to meet the OP's expectations. I asked all these before answering and believe me, I know what you're trying to tell me. If you read the comments the first one was _why convert to str_? – Paulo Bu Feb 21 '14 at 23:23