5

The rss file is shown as below, i want to get the content in section media:group . I check the document of feedparser, but it seems not mention this. How to do it? Any help is appreciated.

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:ymusic="http://music.yahoo.com/rss/1.0/ymusic/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel>
        <title>XYZ InfoX:  Special hello  </title>
        <link>http://www1.XYZInfoX.com/learninghello/home</link>
        <description>hello</description>
        <language>en</language>         <copyright />
        <pubDate>Wed, 17 Mar 2010 08:50:06 GMT</pubDate>
        <dc:creator />
        <dc:date>2010-03-17T08:50:06Z</dc:date>
        <dc:language>en</dc:language> <dc:rights />
        <image>
            <title>Voice of America</title>
            <link>http://www1.XYZInfoX.com/learninghello</link>
            <url>http://media.XYZInfoX.com/designimages/XYZRSSIcon.gif</url>
        </image>

        <item>
                <title>Who Were the Deadliest Gunmen of the Wild West?</title>
                <link>http://www1.XYZInfoX.com/learninghello/home/Deadliest-Gunmen-of-the-Wild-West-87826807.html</link>
                <description> The story of two of them: "Killin'" Jim Miller was an outlaw, "Texas" John Slaughter was a lawman | EXPLORATIONS  </description>
                <pubDate>Wed, 17 Mar 2010 00:38:48 GMT</pubDate>
                <guid isPermaLink="false">87826807</guid>
                <dc:creator></dc:creator>
                <dc:date>2010-03-17T00:38:48Z</dc:date>                                                                                                                                     
                <media:group>
                    <media:content url="http://media.XYZInfoX.com/images/archives_peace_comm_480_16mar_se.jpg" medium="image" isDefault="true" height="300" width="480" />
                    <media:content url="http://media.XYZInfoX.com/images/archives_peace_comm_230_16mar_se_edited-1.jpg" medium="image" isDefault="false" height="230" width="230" />
                    <media:content url="http://media.XYZInfoX.com/images/tex_trans_lawmans_230_16mar10_se.jpg" medium="image" isDefault="false" height="230" width="230" />
                    <media:content url="http://www.XYZInfoX.com/MediaAssets2/learninghello/dalet/se-exp-outlaws-part2-17mar2010.Mp3" type="audio/mpeg" medium="audio" isDefault="false" />
                </media:group>
     </item>
Mingo
  • 1,613
  • 2
  • 16
  • 20
  • I can tell you how to extract data from an XML document, but I'm not familiar with `feedparser` or the way it presents a feed. If you re-phrase the question in a `I have this input data`, `I want this output data`, it would be easier to help you. – MattH Mar 17 '10 at 12:32
  • Thanks, but I just want to make the code sample. I understood it can be parsed as xml or by regular express. – Mingo Mar 17 '10 at 12:56
  • C: I do not understand what you mean by `Thanks, but I just want to make the code sample`. It makes even less sense in the context of a reply to "please specify an example of your input and desired output data". – MattH Mar 17 '10 at 13:50
  • @MattH: sorry for my spell error. I mean I want my code to be simple. And It seems that feedparser do not parse the media:group. Now I do the job using RE. Thanks for your kindly comment. – Mingo Mar 17 '10 at 14:42
  • C: You want your code to be simple, so you're parsing XML with a regexp. I didn't realise until now that it was possible, but you are making both more and less sense at the same time! :) Good Luck. – MattH Mar 17 '10 at 14:54

2 Answers2

7

feedparser 4.1 as available from PyPi has this bug.

the solution for me was to get the latest feedparser.py (4.2 pre) from the repository.

svn checkout http://feedparser.googlecode.com/svn/trunk/ feedparser-readonly
cd feedparser-readonly
python setup.py install

now you can access all mrss items

>>> import feedparser  # the new version!
>>> d = feedparser.parse(MY_XML_URL)
>>> for content in d.entries[0].media_content: print content['url']

should do the job for you

captnswing
  • 615
  • 7
  • 10
-1

You can parse the feed using

feed = feedparser.parse(your_feeds_url)

and then access your xml elements using either python's attribute access or dictionary-like access on feed and its subelements. The former method won't work for an element name like media:content, so use the latter method.

The rest should become clear after studying the examples at http://www.feedparser.org

Johannes Charra
  • 29,455
  • 6
  • 42
  • 51
  • I print the content of the feed, it do not contain the information of media:content. I think feedparser skip to parse it. This is the RSS URL: http://www1.voanews.com/templates/Articles.rss?sectionPath=/learningenglish/home – Mingo Mar 17 '10 at 14:48