0

If I add a feed URL to Google Reader or to a desktop feed aggregator, I receive nice results. The URL is:

http://estaticos03.marca.com/rss/futbol_1adivision.xml

But when I fetch the same URL from a script (python script, using feedparser library) I am getting slightly different content for the same results (the title for each entry, for example, is different and all in uppercase).

I believe something is done on the server-side to try to discourage people like me to parse the content for my own projects (the feed is from a popular football newspaper), but I am not sure about it. I tried to pass some user agents (like the google reader one) but still no luck, so maybe they check the IP as well? I am really confused.

Any idea why is this happening to me?

Thanks!

Dan Lowe
  • 51,713
  • 20
  • 123
  • 112
nabucosound
  • 1,283
  • 1
  • 12
  • 23
  • 1
    Maybe ask them? And how can they check the IP. Your browser and your Python script has the same IP. :) – Lennart Regebro Jan 09 '11 at 22:31
  • If I asked them I don't think they would answer me anyway. And for the IP you are right but maybe they first check the user agent and if it is, let's say, google reader, then they could check the IP. But I don't think they are so sophisticated... – nabucosound Jan 09 '11 at 23:50
  • could you provide the rss url your trying to access? I'd be interested in seeing what's going on. – smilbandit Jan 10 '11 at 02:41
  • Sure, I have updated the question with the URL – nabucosound Jan 12 '11 at 11:04
  • 1
    Next improvement to question would be to paste your feedparser script. Also, how about the raw text Python is fetching? Does it look more complete than what you get from feedparser? – TryPyPy Jan 12 '11 at 12:04

3 Answers3

0

AFAIK Google Reader does some "magic" in the content to beautify it. They strip some tags and styles to avoid breaking their interface.

Can you provide more details on the differences?

  • No it's not related to beautifying content. The title is totally different if I fetch from google reader or NewsFire (desktop reader) than fetching from my python script (it is shorter, and in uppercase). This is the only difference I see so far, but this is driving me nuts. The title Google Reader displays doesn't exist in my python script results, so I guess the RSS server is allowing Google Reader and NewsFire to a nicer feed source than me... – nabucosound Jan 09 '11 at 23:44
0

Did you changed the user agent of your script? Try to mimic Firefox and see what happen.

Esteban Feldman
  • 3,288
  • 6
  • 32
  • 31
  • Yes I did, I tried for example "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" or "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b8) Gecko/20100101 Firefox/4.0b8" but no success... – nabucosound Jan 12 '11 at 11:53
0

All right folks, I found it. I analyzed the source XML received (as @TryPyPy). I had been trusting too much the feedparser library. Latest official version (4.1) has a bug related to mistakeing the title tag from media namespace instead of the original one:

http://code.google.com/p/feedparser/issues/detail?id=76

So I reinstalled from trunk and now everything is OK. Thanks for helping anyway!

nabucosound
  • 1,283
  • 1
  • 12
  • 23