0

I'm using feedparser to fetch RSS feed data. For most RSS feeds that works perfectly fine. However, I know stumbled upon a website where fetching RSS feeds fails (example feed). The return result does not contain the expected keys and the values are some HTML codes.

I tries simply reading the feed URL with urllib2.Request(url). This fails with a HTTP Error 405: Not Allowed error. If I add a custom header like

headers = {
    'Content-type' : 'text/xml',
    'User-Agent': 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0',
}

request = urllib2.Request(url)

I don't get the 405 error anymore, but the returned content is a HTML document with some HEAD tags and an essentially empty BODY. In the browser everything looks fine, same when I look at "View Page Source". feedparser.parse also allows to set agent and request_headers, I tried various agents. I'm still not able to correctly read the XML let alone the parsed feed from feedparse.

What am I missing here?

Christian
  • 3,239
  • 5
  • 38
  • 79
  • @JulienGenestoux: Turns out the side creates the RSS feed dynamically via Javascript -- the first time I came across that. Can your system handle this. I currently have a kind of working solution using PhantomJS. I will probably provide an answer soon based on this. – Christian May 21 '15 at 02:50
  • No they don't generate the feed via JS :) (I'm not even sure that's possible!). See below for an answer... and yes, Superfeedr is able to handle that pretty well! – Julien Genestoux May 21 '15 at 15:51

1 Answers1

0

So, this feed yields a 405 error when the client making the request does not use a User-Agent. Try this:

$ curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -o /dev/null -D- -s
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 21 May 2015 15:48:44 GMT
Content-Type: application/xml; charset=utf-8
Content-Length: 24616
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding

While without the UA, you get:

$ curl 'http://www.propertyguru.com.sg/rss' -o /dev/null -D- -s
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Thu, 21 May 2015 15:49:20 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding
Julien Genestoux
  • 31,046
  • 20
  • 66
  • 93
  • As I wrote in my question, I got around the 405 error by setting the header. But neither using cURL or using Python request I get an XML document as I would expect from an RSS feed. Only some almost empty HTML document. Could you post the cURL that actually outputs the XML? – Christian May 22 '15 at 00:00
  • That's what I did! (remove -o /dev/null) and you'll see the full XML! – Julien Genestoux May 22 '15 at 09:27
  • Uh, doesn't work for me. If I do `curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -D- -s` I get at minimal HTML document with body `
     
    `.
    – Christian May 22 '15 at 10:07