Parsing different unicode files using BeautifulSoup

Question

I have this particular HTML page having codec

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">

Now When I am trying to parse this particular file using BeautifulSoup, it always returns NULL object. I can convert it using:

page = codecs.open('file_name', 'r', 'cp1251')
soup = BeautifulSoup(page.read())

Now it's working fine. But in my collection, I have got pages consisting of both UTF-8 and windows-1251 charset types. So, I wanted to know what is the procedure to determine the charset of a particular HTML page, and convert it accordingly if it's in windows-1251 format ?

I found this:

soup.originalEncoding

But for that I need to load it into 'soup'. But there only it's returning 'None type object'. Any help would be highly appreciated.

I am using Python 2.7

EDIT:

Here is an example of what I am actually trying to say:

This is my code:

from bs4 import BeautifulSoup
import urllib2

page=urllib2.urlopen(Page_link)
soup = BeautifulSoup(page.read())

print soup.html.head.title

page having

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

correctly displays the title of the page.

Now if a page has

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">

then the output is

AttributeError: 'NoneType' object has no attribute 'head'

Now I can fix this using the codecs library as mentioned above. What I am trying to find out is how to determine the encoding in order to be able to apply it.

These are the two sites that am trying to crawl and gather certain informations:

http://www.orderapx.com/ and http://www.prpoakland.com/

You decoded the page to Unicode, not UTF8. UTF8 is another encoding, just like Windows-1251.. Can you show us sample source code? Does it really say `"charset=charset=..."` in the source? — Martijn Pieters, Sep 30 '13 at 07:06
Sorry, my bad. I copied wrongly. I edited the line. Thanks for pointing it out. — Koustav, Sep 30 '13 at 07:08
Also, is this BeautifulSoup 3 or 4? *Normally*, I'd expect BeautifulSoup to find the `` tag and detect the encoding for you. — Martijn Pieters, Sep 30 '13 at 07:09
Right. BS4 can use [different HTML parsers](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser); have you got `lxml` installed at all? BS4 will use that as the default if available. You can also try each of the other parsers to see if any works better for your specific content; see http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser — Martijn Pieters, Sep 30 '13 at 07:15
Yes, I have lxml installed. Infact in the first place I tried to do the entire thing it lxml only, but since it seemed to be a bit more complicated to me, hence switched to BeautifulSoup. — Koustav, Sep 30 '13 at 07:22
Then the other tow parsers and see if any of those work better; switch between `'html.parser'` and `'html5lib'` (which is a separate install). If the HTML is malformed different parsers handle the repairing differently. — Martijn Pieters, Sep 30 '13 at 07:24
Ah, you are opening this with `urllib2`? Is there a `Content-Type` header at all for these pages? — Martijn Pieters, Sep 30 '13 at 07:25
Actually the entire dataset consists of 2000 pages collected from all over the web, different marketing websites. Hence I can't say whether all of them contain Content-Type header or not, but at least the few pages whose source codes I have actually seen, contains it. Assuming it is, for all the pages, what do you suggest ? — Koustav, Sep 30 '13 at 07:28
I'd look for a `Content-Type` header, and if present and containing a `charset=..` parameter, use that parameter to tell BeautifulSoup what codec to use. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings. — Martijn Pieters, Sep 30 '13 at 07:30
Yes ! It worked. Thank You very much for your valuable suggestions. Actually I was trying to determine the encoding using soup.originalEncoding (http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Why can't Beautiful Soup print out the non-ASCII characters I gave it?) and it seems to return none type object always. — Koustav, Sep 30 '13 at 07:36
Glad to have been of help! I've summarized my advice into an answer below. — Martijn Pieters, Sep 30 '13 at 07:52

score 1 · Accepted Answer · answered Sep 30 '13 at 07:51

You are loading your pages from the web; look for a content type header with charset parameter to see if the webserver already told you about the encoding:

charset = page.headers.getparam('charset')
soup = BeautifulSoup(page.read(), from_encoding=charset)

If no such parameter is present, charset is set to None and BeautifulSoup will fall back to guessing.

You can also try out different parsers; if the HTML is malformed, different parsers will repair the HTML in different ways, perhaps allowing BeautifulSoup to detect the encoding better.

Parsing different unicode files using BeautifulSoup

1 Answers1