0

In my models.py, I have the following items:

class Feed(models.Model):
    rss_url = models.URLField()
    updated = models.DateTimeField(blank=True, null=True)

    def save(self, *args, **kwargs):
        get_last_updated_date(self) # Get updated date
        super(Feed, self).save(*args, **kwargs)
        generate_content_from_feed(self) # Generating the content

class Content(models.Model):
    title = models.CharField(max_length=500)
    link = models.URLField()
    feed = models.ForeignKey(Feed, related_name='content_feed')

In RSS 2.0 specification, the lastBuildDate for a channel(feed) is not mandatory. Which is why my updated field can be blank and null.

As you probably noticed, I have 2 methods in my save function.

The get_last_updated_date method uses the feed's url to see if a lastBuildDate exists or not, and if it does, it sets it to my updated field after correcting for timezones using pytz.

The generate_content_from_feed method uses the feed's url to get the items from that channel and create Content objects. I need to call this after the super(Feed, self).save(*args, **kwargs) since the Content objects' feed field has to be set to the field that calls the method. So this cannot be done before saving the feed itself.

Now my problem is that if the updated date matches the lastBuildDate from the RSS feed, then I don't want to call the generate_content_from_feed method, as that way, things will be more efficient. However, I am actually setting the updated date before content generation, so if the lastBuildDate exists, it will always match the updated field, and no content will be generated. If I put the get update date function after the super, then save has to be called again, causing an endless loop.

I was thinking about doing something like this:

if self.updated == None:
    generate content
elif self.updated < [[lastBuildDate]]:
    generate content
else:
    dont generate content

I am not sure how to implement this though, as I am having difficulties understanding the flow of the program.

So can anyone help me with my problem? I know this is complicated, but hopefully I have been able to articulate it properly.

Also I am not sure if self.updated == None is valid for checking if the date is blank and using < operator for checking differences between two dates is valid or not. If any of you can tell me, I would really appreciate it.

Thanks. Also I am using feed parser for parsing if anyone is interested.

darkhorse
  • 8,192
  • 21
  • 72
  • 148
  • just out of curiosity, does the RSS feed have an IDENTIFIER in the XML that separates one article from another? Like ARTICLE_ID or something? – arcee123 Apr 28 '16 at 14:44
  • Yeah all the articles are generated as a list which can be iterated. Usually its called items. – darkhorse Apr 28 '16 at 16:41
  • if the `items` can be serialized, you can perform a check against that list to determine if it's already there. – arcee123 Apr 28 '16 at 16:43
  • You are right. I could do that, in fact, I am doing something like to make sure copies of the same article is not made. However, iterating through the list makes the process inefficient. I want to put another check before the iteration to save time. – darkhorse Apr 28 '16 at 16:51
  • it depends on the granularity of the check. If you want RSS-level granularity, then you can just use `Feed.Updated` as a check. If you want Article Granularity, then you have to check each article. However, what you can do, is move the run to the db-level, by inserting all articles, then running a deduplication `delete` sql script on the table. Risky, but effective. – arcee123 Apr 28 '16 at 16:56
  • The other option is to work the SQL statement so you can UPSERT on condition. but that will still do the checks...but at the db layer. – arcee123 Apr 28 '16 at 16:58

0 Answers0