3

I want to run a Python script every so many minutes. The script starts by fetching the newest article from a rss-feed (using feedparser). What I want, is when the newest article is the same as the last time it ran, the script just ends. How do I accomplish this?

typo.pl
  • 8,812
  • 2
  • 28
  • 29
HankSmackHood
  • 4,673
  • 7
  • 29
  • 30

3 Answers3

5

Since you mention feedparser in a python question, I'll assume you mean feedparser.org

If yes, then the easiest way to do this is make the server do most of the work for you and only request updates since the last change you fetched: See ETag and Last-Modified headers for RSS feeds

typo.pl
  • 8,812
  • 2
  • 28
  • 29
  • 1
    That seems like a way better idea yeah, thanks. On base of the documentation I can't really work out to use it though, could you provide an example? – HankSmackHood Feb 07 '11 at 18:33
  • The examples from the above link specify how it's supposed to be used. First time you call, you pass just the url to the parse method. After the data has been returned, you remember/store the etag and modified data for that RSS feed. For subsequent retrievals of the RSS feed, you pass the etag and modified data as part of the parse method call. – typo.pl Feb 07 '11 at 18:53
  • +1 to Last-modified -- don't move bits you don't need to move. – bgporter Feb 07 '11 at 19:00
3

You could store the state in a temporary file. E.g. write the title into a temporary file if there isn't a temporary file already, and next time read from the file and compare the read title with the new title.

Enno Shioji
  • 26,542
  • 13
  • 70
  • 109
  • 3
    I would suggest to also store the `updated` date in case the article was edited and contains different content. – Reiner Gerecke Feb 07 '11 at 18:12
  • Yup, this is the way to do it. Just write it out to a structure. You can consider using JSON or Pickle to write it out as a data structure that can be more easily read back in. – chmullig Feb 07 '11 at 19:47
0

There are a number of different approaches you could take. The easiest is probably to keep a unique key or hash of the most recent article acquired for each feed that your program deals with. This could be a combination of the article title & date, or even an md5sum of the entire contents of the article.

You could then write this data to an XML state file for your script, or use something like cpickle to save the data. Then, every time your program runs, only retrieve the feed articles that are the most recent (checking each against your latest article hash from the last run).

And, of course, don't forget to update your most recent article feed hash before your script exits.

If your script deals with multiple feeds, you'll have to store one of these "latest article hash" items per feed.

Mike Cialowicz
  • 9,892
  • 9
  • 47
  • 76