2

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:

self.html_doc = self.html_doc.decode('gb2312','ignore')

But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.

3 Answers3

3

Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)

>>> soup.original_encoding
u'gbk'

And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:

<meta http-equiv="content-type" content="text/html; charset=GBK" />

>>> soup.meta['content']
u'text/html; charset=GBK'

Now you can decode the HTML:

decoded_html = html.decode(soup.original_encoding)

but there not much point since the HTML is already available as unicode:

>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐

It is also possible to attempt to detect it using the chardet module (although it is a bit slow):

>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • Thank you very much! I am a beginner. – Zhang Yongsheng Jan 28 '15 at 06:54
  • Well, I also have a problem. Maybe, it is easy for you. I try to download the html web page of [link](http://www.sina.com.cn),after I unzip it, it has a UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128). So I try to decode it form gb2312 to unicode,then use beautifulsoup, if I do not,it has unicodeDecodeError. – Zhang Yongsheng Jan 28 '15 at 07:15
  • @ZhangYongsheng: there's an encoding issue with that page - it claims to be GB2312 in the `` tag (and `chardet` agrees), however, there is at least one non-GB2312 byte sequence (bytes 143328-143329 == `\x87N` == 嘚). You can decode that page with GBK (a superset of GB2312), i.e. `soup = BeautifulSoup(urllib2.urlopen('http://www.sina.com.cn/'), from_encoding='gbk')`. You can manually decode using `html = urllib2.urlopen('http://www.sina.com.cn/').read().decode('gbk')` – mhawke Jan 28 '15 at 10:39
  • You are really an expert! Thanks a lot. :) – Zhang Yongsheng Jan 28 '15 at 12:19
1

Another solution.

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.sohu.com') # This will automatically help you find the correct encoding
doc = SimplifiedDoc(html)
print (doc.title.text)
dabingsou
  • 2,469
  • 1
  • 5
  • 8
0

I know this is an old question, but I spent a while today puzzling over a particularly problematic website so I thought I'd share the solution that worked for me, which I got from here: http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html

Requests has a feature that will automatically get the actual encoding of the website, meaning you don't have to wrestle with encoding/decoding it (before I found this, I was getting all sorts of errors trying to encode/decode strings/bytes and never getting any output which was readable). This feature is called apparent_encoding. Here's how it worked for me:

from bs4 import BeautifulSoup
import requests

url = 'http://url_youre_using_here.html'
readOut = requests.get(url)
readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
soup = BeautifulSoup(readOut.text, "lxml")
Calos
  • 1,783
  • 19
  • 28
jbat
  • 33
  • 6