Python FeedParser format Reddit Nicely

Question

I am trying to create a program that prints out the first 5 jokes from /r/Jokes but I am having some trouble formatting it to look nice. I want to have it set out like this.

Post Title: Post Content

For example, here is one of the jokes directly from the RSS feed:

<item>

    <title>What do you call a stack of pancakes?</title>

    <link>https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/</link>

    <guid isPermaLink="true">https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/</guid>

    <pubDate>Sun, 30 Aug 2015 03:18:00 +0000</pubDate>

    <description><!-- SC_OFF --><div class="md"><p>A balanced breakfast</p> </div><!-- SC_ON --> submitted by <a href="http://www.reddit.com/user/TheRealCreamytoast"> TheRealCreamytoast </a> <br/> <a href="http://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/">[link]</a> <a href="https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/">[2 comments]</a></description>

</item>

I am currently printing the title, followed by a colon and a space, and then the description. However it prints all the text, including the links, the author and all the HTML tags. How would I just get the text inside the paragraph tags.

Thanks,

EDIT: This is my code:

d = feedparser.parse('https://www.reddit.com/r/cleanjokes/.rss')
print("")
print("Pulling latest jokes from Reddit. https://www.reddit.com/r/cleanjokes")
print("")
time.sleep(0.8)
print("Displaying First 5 Jokes:")
print("")
print(d['entries'][0]['title'] + ": " + d['entries'][0]['description'])
print(d['entries'][1]['title'] + ": " + d['entries'][1]['description'])
print(d['entries'][2]['title'] + ": " + d['entries'][2]['description'])
print(d['entries'][3]['title'] + ": " + d['entries'][3]['description'])
print(d['entries'][4]['title'] + ": " + d['entries'][4]['description'])

This just gets the first 5 entries. What I need to do is format the description string after the colon to only include the text inside the paragraph tags.

How do you get the title? (I want to see some code, it will help me help you (maybe)). — David Mašek, Aug 30 '15 at 17:55

score 2 · Accepted Answer · edited May 23 '17 at 11:52

2

Oren is right about using BeautifulSoup but I'll try to provide more complete answer.

d['entries'][0]['description'] returns html and you need to parse that. bs is great library for that.

You can install it using:

pip install beautifulsoup4

from bs4 import BeautifulSoup 
soup = BeautifulSoup(d['entries'][0]['description'], 'html.parser') 
print(soup.div.get_text())

Get's text from the div part of the entry.

edited May 23 '17 at 11:52

Community

1
1

answered Aug 30 '15 at 18:50

David Mašek

913
8
23

I need to get the text between the paragraph tag so I change div.get to p.get, but I get this Error: AttributeError: 'NoneType' object has no attribute 'get_text' – FeaturedEpic Aug 30 '15 at 19:17
@FeaturedEpic `soup.p.get_text()` works fine for me. But the original gets you the text you want too. `
` is subset of `
` (in this case!).
– David Mašek Aug 30 '15 at 19:22

score 0 · Answer 2 · answered Aug 30 '15 at 18:31

0

You can use beautiful soap package which do exactly that

Link to documention

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc, 'html.parser') 
print(soup.get_text())

answered Aug 30 '15 at 18:31

Oren Haliva

351
3
14

Python FeedParser format Reddit Nicely

2 Answers2