2

I am trying to create a program that prints out the first 5 jokes from /r/Jokes but I am having some trouble formatting it to look nice. I want to have it set out like this.

Post Title: Post Content

For example, here is one of the jokes directly from the RSS feed:

<item>

    <title>What do you call a stack of pancakes?</title>

    <link>https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/</link>

    <guid isPermaLink="true">https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/</guid>

    <pubDate>Sun, 30 Aug 2015 03:18:00 +0000</pubDate>

    <description><!-- SC_OFF --><div class="md"><p>A balanced breakfast</p> </div><!-- SC_ON --> submitted by <a href="http://www.reddit.com/user/TheRealCreamytoast"> TheRealCreamytoast </a> <br/> <a href="http://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/">[link]</a> <a href="https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/">[2 comments]</a></description>

</item>

I am currently printing the title, followed by a colon and a space, and then the description. However it prints all the text, including the links, the author and all the HTML tags. How would I just get the text inside the paragraph tags.

Thanks,

EDIT: This is my code:

d = feedparser.parse('https://www.reddit.com/r/cleanjokes/.rss')
print("")
print("Pulling latest jokes from Reddit. https://www.reddit.com/r/cleanjokes")
print("")
time.sleep(0.8)
print("Displaying First 5 Jokes:")
print("")
print(d['entries'][0]['title'] + ": " + d['entries'][0]['description'])
print(d['entries'][1]['title'] + ": " + d['entries'][1]['description'])
print(d['entries'][2]['title'] + ": " + d['entries'][2]['description'])
print(d['entries'][3]['title'] + ": " + d['entries'][3]['description'])
print(d['entries'][4]['title'] + ": " + d['entries'][4]['description'])

This just gets the first 5 entries. What I need to do is format the description string after the colon to only include the text inside the paragraph tags.

FeaturedEpic
  • 37
  • 1
  • 10

2 Answers2

2

Oren is right about using BeautifulSoup but I'll try to provide more complete answer.

d['entries'][0]['description'] returns html and you need to parse that. bs is great library for that.

You can install it using:

pip install beautifulsoup4

from bs4 import BeautifulSoup 
soup = BeautifulSoup(d['entries'][0]['description'], 'html.parser') 
print(soup.div.get_text())

Get's text from the div part of the entry.

Community
  • 1
  • 1
David Mašek
  • 913
  • 8
  • 23
  • I need to get the text between the paragraph tag so I change div.get to p.get, but I get this Error: AttributeError: 'NoneType' object has no attribute 'get_text' – FeaturedEpic Aug 30 '15 at 19:17
  • @FeaturedEpic `soup.p.get_text()` works fine for me. But the original gets you the text you want too. `

    ` is subset of `

    ` (in this case!).
    – David Mašek Aug 30 '15 at 19:22
0

You can use beautiful soap package which do exactly that

Link to documention

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc, 'html.parser') 
print(soup.get_text())
Oren Haliva
  • 351
  • 3
  • 14