3

I'm struggling with this for a while now. The following code snippet returns None for some websites even if the charset presents in the meta of header, so it doesn't seem to be a reliable way to get the proper charset of a webpage.

conn = urllib2.urlopen(req)
charset = conn.headers.getparam('charset')

I read several threads here on SO and some mentions to use chardet but I don't want to import an additional module if possible. Instead I'm thinking to download only the header and get the charset info by using some string functions.

Does anybody has a better idea?

g0m3z
  • 691
  • 1
  • 12
  • 25

2 Answers2

2

conn.headers.getparam('charset') doesn't parse html content (<meta> tag) it looks only in http headers (e.g., Content-Type).

You could use an html parser to get the character encoding if it is not specified in http headers.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Got it and thanks! I just checked the header of the page and it doesn't contain charset at all. – g0m3z Sep 02 '14 at 13:39
  • If anyone interested in my solution I paste it here. I use `cssselect` of `lxml` module to get the information: `charset = site.cssselect('meta[http-equiv="Content-Type"]')[0].get('content').split("charset=",1)[1]` – g0m3z Sep 03 '14 at 08:41
  • @g0m3z: you should post it as your own answer. Why do you need the character encoding *after* you already completely parsed the html? Follow [the link](http://stackoverflow.com/a/15305248/4279) that I've provided in the answer and see how the encoding is handled. – jfs Sep 03 '14 at 10:02
  • @Sebastian Maybe you are right and I do something wrong. I use `parse.getroot()` function of `xhtml` which returns an `HTMLelement` object and I get the relevant information from this `HTMLelement` object by applying some xpath expressions on it, which returns some specific characters that need to be encoded properly. Is it an incorrect approach? – g0m3z Sep 03 '14 at 11:43
  • @g0m3z: it is incorrect approach if your strings are already Unicode (look at the strings returned by lxml). Note: On Python 2, lxml may use `str` type if a string is ascii-only (an optimization of some kind). – jfs Sep 03 '14 at 20:16
0

Moving my comment here and post it as an answer.

Thanks to @J.F. Sebastian I could get the charset from meta tag by using the below code snippet:

conn = urllib2.urlopen(url)
site = parse(conn).getroot()
charset = site.cssselect('meta[http-equiv="Content-Type"]')[0].get('content').split("chars‌​et=",1)[1]
g0m3z
  • 691
  • 1
  • 12
  • 25
  • It is *not* what I've suggested [follow the link](http://stackoverflow.com/a/15305248/4279). (I don't want to copy-paste the answer verbatim). – jfs Sep 03 '14 at 20:17
  • Maybe my understanding is not correct with regards to encoding/decoding. What I'm trying to achieve is to garb data from 3 different websites/encodings and store those in an SQLite DB. The reason why I posted my original question was that I print all data to screen for debugging reason and the result was hex stream in case of those charasters which are in unicode (e.g.: \xef\xb6\x9b). Maybe it's not an issue at all or only an issue with the representation of data on screen and I could store data in DB as it present on the website. Do I need to decode data at all? Thanks for your help so far! – g0m3z Sep 06 '14 at 14:08
  • if you follow the code in the link then you should be able to get your data as a Unicode string. If you can't; provide a complete minimal example that shows the issue. If you have problems with displaying unicode or saving it into a database, ask another question. – jfs Sep 06 '14 at 14:46