how to parse a binary encoded rss feed

Question

hi i am having a problem downloading and reading in a rss feed from a particular site, the issue seems to be that the resulting downloaded rss feed looks to be in binary format, can anybody tell me how i can get this back into a readable format that i can then send to beautiful soup for parsing?.

here is my code so far:-

import urllib2
from BeautifulSoup import BeautifulSoup

rss_feed = urllib2.urlopen("http://kat.ph/usearch/ubuntu/?rss=1", timeout=5.0).read()
print rss_feed #will display binary not expected xml
rss_feed_soup = BeautifulSoup(rss_feed)

so just to clarify i cannot seem to get the resulting xml when trying to read using urllib2, if i view the rss feed in any modern web browser the rss is displayed correctly, what am i missing here? , is the rss feed binary encoded and if so how do i correctly decode it?.

thanks for any replies.

Martijn Pieters · Accepted Answer · 2012-12-20T22:22:29.583

1

The feed is gzipped by the server for efficient downloading; it has a Content-Encoding: gzip header set.

Use feedparser to download and parse it instead of using urllib2 and BeautifulSoup.

If you have to use urllib2, also use the gzip module to decompress the content first:

import gzip
from cStringIO import StringIO

rss_feed = gzip.GzipFile(fileobj=StringIO(rss_feed)).read()

edited Dec 20 '12 at 22:22

answered Dec 20 '12 at 22:13

Martijn Pieters

1,048,767
296
4,058
3,343

thanks very much for your quick reply, i will take a look at this first thing tomorrow morning and report back. – binhex Dec 20 '12 at 22:27

how to parse a binary encoded rss feed

1 Answers1