Getting rss exactly same format

Question

Let me get there straight, I'm trying to make reader web app alike google reader, feedly etc... Hence i'm trying get rss by python using feedparser library. The thing is all website's rss is not in same format i mean some of them has no title, some of them has no publish date in RSS. However, i found that digg.com/reader is very useful digg's reader get rss with publish date and title too i wonder how this thing is work? Anyone got a clue or any little help would be appreciated

Mir Ilias · Answer 1 · 2015-04-24T19:41:00.770

0

you can use feedparser to know if a website have atom or rss, and then deal with each type.If a website has not a publish date or title, you can extract them using other librairies like goose-extractor (As an example :

from newspaper import Article
import feedparser

def extract_date(url):
    article = Article(url)
    article.download()
    article.parse()
    date=article.publish_date
    return date

d=feedparser.parse("http://feeds.feedburner.com/webnewsit") #an italian website
d.entries[0] # the last entry
try :
    d.entries[0].published
except AttributeError:
    link_last_entry=d.entries[0].link
    publish_date=extract_date(link_last_entry)

Let me know if you still don't get the publication date

edited Apr 24 '15 at 19:41

answered Apr 24 '15 at 15:30

Mir Ilias

475
3
9

Thanks for answering, For example i'm trying to get publish date from [this rss](http://www.news.mn/rss/news.rss) and there is no publish date in it. As for digg it get publish date with no problem – Zorig Apr 24 '15 at 15:57
Yes you can do it with goose, replace article.title in the function with article.publish_date – Mir Ilias Apr 24 '15 at 16:35
Ok i did exactly what you said with `article.publish_date` , and it give me back nothing(without error) so i perceived that there is publish_date but don't get it why it give back nothing – Zorig Apr 24 '15 at 17:25
ah Date extractor is not implemented in goose until now, i'm gonna edit my script so you can extract the **publish_date** by using another good library – Mir Ilias Apr 24 '15 at 19:31
I tried updated code, still don't get publication date – Zorig Apr 25 '15 at 01:48
it doesn't work with news.mn.. I think because they don't have a publication date attribute in their html page.. but it will work on the majority of websites who don't have a **publish** attribute in rss – Mir Ilias Apr 25 '15 at 02:11

score 0 · Answer 2 · answered Apr 24 '15 at 15:32

0

I've recently done some projects with the feed parser library and it can be very frustrating since many rss feeds are different. What works the most for me is something like this:

#to get posts from hackaday.com
import feedparser
feed = feedparser.parse("http://www.hackaday.com/blog/feed/") #get feed from hackaday
feed = feed['items'] #Get items in feed (this is the best way I've found)
print feed[0]['title'] #print post title
print feed[0]['summary'] #print post summary
print feed[0]['published'] #print date published

These are just a few of the different "fields" that feed parser has. To find the one you want just run these commands in the python shell and see what fits your needs.

answered Apr 24 '15 at 15:32

Njord

142
6

Thanks for answering, Ok i tried your snippets and here i got `print feed[0]['published'] Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/feedparser.py", line 356, in __getitem__ return dict.__getitem__(self, key) KeyError: 'published' ` – Zorig Apr 24 '15 at 15:53
I tried to get publish date from [this rss](http://www.news.mn/rss/news.rss) apparently there is no **published** field in it. And i'm stuck :) – Zorig Apr 24 '15 at 15:58
Hmm try running just the python shell `python` in terminal if you're on \*nix. (What OS are you on?) And then try the `import feedparser` etc. and try `print feed[0]` to see all the other items in the object you can access. I tried your rss link and it seems there *is* no `published` field. You might want to try using Goose like @MirIlias had suggested. Also python might complain about the non ascii characters. – Njord Apr 24 '15 at 16:14
Yes, i'm on *nix based OS. And, thank you for suggesting plus tried it on my side. – Zorig Apr 24 '15 at 17:13

Getting rss exactly same format

2 Answers2