-1

I've been trying to make a quotescraper and reddit bot on python 3.4 with beautiful soup 4. My code to scrape quotes from goodreads is as followed : http://pastebin.com/1EZHPmym The problem is that it prints out sequences of stuff such as "\xe2\x80\x9c" in between quotes and author's name. I am a complete beginner with programming and I've tried researching about the problem and it comes up as a problem with the encoding. So I went ahead and looked at the charset on the source page of goodreads quotes. Then I looked through the source page to find the exact div containing the text I wanted and I saw :

“Don't cry because it's over, smile because it happened.”
Dr. Seuss

It shows some weird characters such as "&#8213", "&ldquo", etc.. I'm currently working on a bruteforce method to find all weird characters and simply eliminate them from the results. But I watched this video : BeautifulSoup Tutorial and he didn't seem to encounter the same thing as me which makes sense as the source code for the yellow page of los angeles coffee shops didn't have the same weird characters.

The same code adjusted for python 2.7 where I don't have the parenthesis for the print function yields text without the unicode. Is there a reason why?

Note that my present solution is to use python's .replace for Python 3 to eliminate the unicode but is there a better solution?

Note that Beautiful Soup and Unicode Problems explains what is happening very well but I don't understand why this problem doesn't occur in python 2.7

Caius
  • 26
  • 2

1 Answers1

0

Use

b'\xe2\x80\x9c'.decode()

It returns double quote

'“'
tjati
  • 5,761
  • 4
  • 41
  • 56
Sarit Adhikari
  • 1,344
  • 2
  • 16
  • 28