-1

I currently have this code in Python using feedparser:

import feedparser

RSS_FEEDS = {'cnn': 'http://rss.cnn.com/rss/edition.rss'}    

def get_news_test(publication="cnn"):
    feed = feedparser.parse(RSS_FEEDS[publication])
    articles_cnn = feed['entries']

    for article in articles_cnn:
        print(article)


get_news_test()

The above code returns all the current articles. Here is a sample of one of the articles it returned:


{'title': "China's internet shutdowns tactics are spreading worldwide", 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://rss.cnn.com/rss/edition.rss', 'value': "China's internet shutdowns tactics are spreading worldwide"}, 'summary': 'When Hong Kong police fired tear gas at peaceful pro-democracy protesters in 2014, the news moved swiftly through social media. Photos and videos of mostly student demonstrators being gassed helped fuel the outrage that ultimately drove hundreds of thousands of people into the streets.', 'summary_detail': {'type': 'text/html', 'language': None, 'base': 'http://rss.cnn.com/rss/edition.rss', 'value': 'When Hong Kong police fired tear gas at peaceful pro-democracy protesters in 2014, the news moved swiftly through social media. Photos and videos of mostly student demonstrators being gassed helped fuel the outrage that ultimately drove hundreds of thousands of people into the streets.'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.cnn.com/2019/01/17/africa/internet-shutdown-zimbabwe-censorship-intl/index.html'}], 'link': 'https://www.cnn.com/2019/01/17/africa/internet-shutdown-zimbabwe-censorship-intl/index.html', 'id': 'https://www.cnn.com/2019/01/17/africa/internet-shutdown-zimbabwe-censorship-intl/index.html', 'guidislink': False, 'published': 'Fri, 18 Jan 2019 07:40:48 GMT', 'published_parsed': time.struct_time(tm_year=2019, tm_mon=1, tm_mday=18, tm_hour=7, tm_min=40, tm_sec=48, tm_wday=4, tm_yday=18, tm_isdst=0), 'media_content': [{'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-super-169.jpg', 'height': '619', 'width': '1100'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-large-11.jpg', 'height': '300', 'width': '300'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-vertical-large-gallery.jpg', 'height': '552', 'width': '414'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-video-synd-2.jpg', 'height': '480', 'width': '640'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-live-video.jpg', 'height': '324', 'width': '576'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-t1-main.jpg', 'height': '250', 'width': '250'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-vertical-gallery.jpg', 'height': '360', 'width': '270'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-story-body.jpg', 'height': '169', 'width': '300'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-t1-main.jpg', 'height': '250', 'width': '250'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-assign.jpg', 'height': '186', 'width': '248'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-hp-video.jpg', 'height': '144', 'width': '256'}]}

Now I know I can return some portions of this for instance the title by calling:

print(article.title)

But, I am stumped as to how to get the image data from the feed.

Obie
  • 111
  • 1
  • 9
  • Please format your question properly for readability. – 0xInfection Jan 20 '19 at 03:34
  • Okay, then exactly how am I to do this? What should I change to make it more readable? – Obie Jan 20 '19 at 03:45
  • Okay done. Sorry that you downvoted my question since I did not think the result was code. I assumed that code was something I had coded, not a result that was returned. It is asinine nitpicking like this that makes me hesitant to use stackoverflow. – Obie Jan 20 '19 at 03:49
  • SO is the best place to get answers afaik. So when you need help, you are expected to make your question readable. Moreover I did not downvote your answer on the pretext that you had not formatted the question properly, its because you do not have a [minimal idea](https://stackoverflow.com/help/mcve) of what you're doing. The fact the you cannot comprehend that the returned sample is in JSON and you're trying BeautifulSoup on it made me downvote it. In the end I'd just you should get used to [JSON parsing](https://www.pythonforbeginners.com/python-on-the-web/parsingjson/) to resolve this. – 0xInfection Jan 20 '19 at 05:23
  • I tried JSON parsing in the past and could not get it to work and other SO answers on similar questions had suggested BS. Yes I don't know what I am doing but what happened to the idea that SO was supposed to be welcoming per this article: https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/ You certainly don't follow the spirit of that post. – Obie Jan 20 '19 at 06:11
  • Your question is absolutely welcome, why else would I be here trying to help you out. Do not hold a misconception that your post got down-voted on this site and hence it is not welcome. There are many cases on SO where questions got downvoted but they still had 3+ answers (including an accepted one). Votes are given on the __quality of questions asked__. If you had even basic searched what BeautifulSoup is, you'd find that it is a HTML parser. You mentioned that you "had tried BS, with no luck". The content returned by the site, contains no HTML, I guess you should have at least noticed that. – 0xInfection Jan 20 '19 at 06:22
  • NOT json but XML and BeautifulSoup DOES work with XML parser. Now to only figure out how to access the data within the tags. – Obie Jan 20 '19 at 09:38

1 Answers1

4

Each article entry has a list of assets in media_content. Each asset node contains the media type (I only saw 'image'), size, url, etc.

To simply list the media type and url for each asset, you can use the following:

import feedparser

feed = feedparser.parse("http://rss.cnn.com/rss/edition.rss")

for article in feed["entries"]:
    for media in article.media_content:
        print(f"medium: {media['medium']}")
        print(f"   url: {media['url']}")

Output:

medium: image
   url: https://cdn.cnn.com/cnnnext/dam/assets/190107112254-01-game-of-thrones-spain-castle-of-zafra-t1-main.jpg
medium: image
   url: https://cdn.cnn.com/cnnnext/dam/assets/190107112254-01-game-of-thrones-spain-castle-of-zafra-assign.jpg
medium: image
   url: https://cdn.cnn.com/cnnnext/dam/assets/190107112254-01-game-of-thrones-spain-castle-of-zafra-hp-video.jpg
...

If you want to request and save assets of type 'image', you can use requests:

import feedparser
import os
import requests

feed = feedparser.parse("http://rss.cnn.com/rss/edition.rss")

for article in feed["entries"]:
    for media in article.media_content:
        if media["medium"] == "image":
            img_data = requests.get(media["url"]).content
            with open(os.path.basename(media["url"]), "wb") as handler:
                handler.write(img_data)
cody
  • 11,045
  • 3
  • 21
  • 36