RSS/Python - Parsing Single Image URL

Question

I'm in the works of learning to parse xml and rss feeds correctly and have run in to a little problem. I'm using feedbarser in python to parse a specific entry from an RSS feed, but can't figure out how to parse just a single img src from the content section.

Here's what I have so far.

import dirFeedparser.feedparser as feedparser

feedurl = feedparser.parse('http://dustinheroin.chompblog.com/index.php?cat=22&feed=rss2')
statusupdate = feedurl.entries[0].content

print statusupdate

Now, when I print the content I get this:

[{'base': u'http://dustinheroin.chompblog.com/index.php?cat=22&feed=rss2', 'type': u'text/html', 'value': u'<p><a href="http://dustinheroin.chompblog.com/wp-content/uploads/2012/01/20120129-154945.jpg"><img alt="20120129-154945.jpg" class="alignnone size-full" src="http://dustinheroin.chompblog.com/wp-content/uploads/2012/01/20120129-154945.jpg" /></a></p>', 'language': None}]

What method would be best to get the IMG SRC from that? Any help is appreciated, thanks!

The value you have shown us it it `content` or `statusupdate`. — RanRag, Jan 29 '12 at 22:03

Gareth Latty · Answer 1 · 2012-01-31T01:21:52.620

You then want to use a separate HTML parser to parse the HTML and get the img's src attribute. You might want to look into Beautiful Soup.

e.g:

from BeautifulSoup import BeautifulSoup
import feedparser

feedurl = feedparser.parse('http://dustinheroin.chompblog.com/index.php?cat=22&feed=rss2')
statusupdate = feedurl.entries[0].content[0]

soup = BeautifulSoup(statusupdate["value"])
print(soup.find("img")["src"])

Note that this simply uses the first img tag it finds. If you need to be more selective, look at findall.

Blender · Answer 2 · 2012-01-30T01:44:18.853

3

If you want to get a good HTML parser, try BeautifulSoup.

It's easy to parse with it:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(statusupdate['value'])
url = soup.find('img').src

edited Jan 30 '12 at 01:44

answered Jan 29 '12 at 21:45

Blender

289,723
53
439
496

The element is a dictionary, so you need to access the attribute with ["src"] not .src - as in my answer. – Gareth Latty Jan 29 '12 at 21:48
BeautifulSoup works with either approach. The element *acts* like a dictionary, but is a `BeautifulSoup.Tag` object. – Blender Jan 30 '12 at 01:44
I tried this, and it didn't work for me. I accessed it as a dictionary-like object, and it worked, as an attribute, I got None. I just tried again to the same end – Gareth Latty Jan 31 '12 at 01:23

RanRag · Answer 3 · 2012-01-30T00:55:34.013

3

You can also try lxml . With lxml you can use xpath expressions.

Here x is your statusupdate.

from lxml import etree
st = x[0]["value"]
doc = etree.fromstring(st)
value = doc.xpath("//img/@src") #xpath expr = //img/@src
"".join(value)

Output = 'http://dustinheroin.chompblog.com/wp-content/uploads/2012/01/20120129-154945.jpg'

edited Jan 30 '12 at 00:55

answered Jan 29 '12 at 21:57

RanRag

48,359
38
114
167

score 2 · Accepted Answer · answered Jan 29 '12 at 23:22

@Lattyware, you have some problem with setting soap.

@user1130601, you can check the following code:

#!/usr/bin/python

from BeautifulSoup import BeautifulSoup
import feedparser

feedurl = feedparser.parse('http://dustinheroin.chompblog.com/index.php?cat=22&feed=rss2')
statusupdate = feedurl.entries[0].content


soup = BeautifulSoup(statusupdate[0]['value'])
print(soup.find("img")["src"])

Output:

http://dustinheroin.chompblog.com/wp-content/uploads/2012/01/20120129-171134.jpg

After having to make a couple modifications to feedparser.py and to this, I managed to get my results perfect. Thanks! — user1130601, Jan 30 '12 at 03:48

RSS/Python - Parsing Single Image URL

4 Answers4