Python/Feedparser: reading RSS feed fails

Question

I'm using feedparser to fetch RSS feed data. For most RSS feeds that works perfectly fine. However, I know stumbled upon a website where fetching RSS feeds fails (example feed). The return result does not contain the expected keys and the values are some HTML codes.

I tries simply reading the feed URL with urllib2.Request(url). This fails with a HTTP Error 405: Not Allowed error. If I add a custom header like

headers = {
    'Content-type' : 'text/xml',
    'User-Agent': 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0',
}

request = urllib2.Request(url)

I don't get the 405 error anymore, but the returned content is a HTML document with some HEAD tags and an essentially empty BODY. In the browser everything looks fine, same when I look at "View Page Source". feedparser.parse also allows to set agent and request_headers, I tried various agents. I'm still not able to correctly read the XML let alone the parsed feed from feedparse.

What am I missing here?

@JulienGenestoux: Turns out the side creates the RSS feed dynamically via Javascript -- the first time I came across that. Can your system handle this. I currently have a kind of working solution using PhantomJS. I will probably provide an answer soon based on this. — Christian, May 21 '15 at 02:50
No they don't generate the feed via JS :) (I'm not even sure that's possible!). See below for an answer... and yes, Superfeedr is able to handle that pretty well! — Julien Genestoux, May 21 '15 at 15:51

score 0 · Answer 1 · answered May 21 '15 at 15:50

So, this feed yields a 405 error when the client making the request does not use a User-Agent. Try this:

$ curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -o /dev/null -D- -s
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 21 May 2015 15:48:44 GMT
Content-Type: application/xml; charset=utf-8
Content-Length: 24616
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding

While without the UA, you get:

$ curl 'http://www.propertyguru.com.sg/rss' -o /dev/null -D- -s
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Thu, 21 May 2015 15:49:20 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding

As I wrote in my question, I got around the 405 error by setting the header. But neither using cURL or using Python request I get an XML document as I would expect from an RSS feed. Only some almost empty HTML document. Could you post the cURL that actually outputs the XML? — Christian, May 22 '15 at 00:00
That's what I did! (remove -o /dev/null) and you'll see the full XML! — Julien Genestoux, May 22 '15 at 09:27
Uh, doesn't work for me. If I do `curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -D- -s` I get at minimal HTML document with body `

`. — Christian, May 22 '15 at 10:07

Python/Feedparser: reading RSS feed fails

1 Answers1