0
  • I am downloading feed using aiohttp asyncio
  • Feedparser is supposed to sanitize input text with its _HTMLSanitizer class to accept only specific tags

The sanitizer does not work, any suggestions?

import aiohttp
import asyncio
import feedparser

feedparser._HTMLSanitizer.acceptable_elements = ['a', 'img']
feedparser._HTMLSanitizer.acceptable_css_keywords = []
feedparser._HTMLSanitizer.acceptable_css_properties = []
feedparser._HTMLSanitizer.acceptable_svg_properties = []
feedparser._HTMLSanitizer.acceptable_attributes = ['href', 'src']

async def load_feed():
  async with aiohttp.ClientSession() as session:
    async with session.get('http://zycrypto.com/feed') as response:
      text = await response.text()
      parsed = feedparser.parse(text)
      for entry in parsed.entries:
        print(entry.title)
        print(entry.summary, '\n\n')


asyncio.get_event_loop().run_until_complete(load_feed())

Current Output

Pantera Capital CEO: XRP Is One Of The Few Cryptocurrencies That’ll Be Really Important Ten Years From Now
<div><img width="696" height="406" src="https://zycrypto.com/wp-content/uploads/2020/08/Pantera-Capital-CEO_-XRP-Is-One-Of-The-Few-Cryptocurrencies-That’ll-Be-Really-Important-Ten-Years-From-Now-1024x597.jpg" class="attachment-large size-large wp-post-image" alt="Pantera Capital CEO: XRP Is One Of The Few Cryptocurrencies That’ll Be Really Important Ten Years From Now" style="margin-bottom: 15px;" srcset="https://zycrypto.com/wp-content/uploads/2020/08/Pantera-Capital-CEO_-XRP-Is-One-Of-The-Few-Cryptocurrencies-That’ll-Be-Really-Important-Ten-Years-From-Now-1024x597.jpg 1024w, https://zycrypto.com/wp-content/uploads/2020/08/Pantera-Capital-CEO_-XRP-Is-One-Of-The-Few-Cryptocurrencies-That’ll-Be-Really-Important-Ten-Years-From-Now-300x175.jpg 300w, https://zycrypto.com/wp-content/uploads/2020/08/Pantera-Capital-CEO_-XRP-Is-One-Of-The-Few-Cryptocurrencies-That’ll-Be-Really-Important-Ten-Years-From-Now-768x448.jpg 768w, https://zycrypto.com/wp-content/uploads/2020/08/Pantera-Capital-CEO_-XRP-Is-One-Of-The-Few-Cryptocurrencies-That’ll-Be-Really-Important-Ten-Years-From-Now.jpg 1200w" sizes="(max-width: 696px) 100vw, 696px" /></div>As we all know, the future is quite unpredictable. The only thing that remains crystal clear is the fact that only the strongest will survive in the long-term. In the crypto space, Ripple’s XRP will be one of the very few cryptocurrencies that will stand the test of time. This is according to Dan Morehead, [&#8230;] 


Erik Voorhees: Bitcoin And Stablecoins Will Eventually Take The Place Of Gold And Bank Notes
<div><img width="696" height="406" src="https://zycrypto.com/wp-content/uploads/2020/04/Robert-Kiyosaki_-‘Save-Gold-Silver-Bitcoin’-Instead-of-the-Dollar-1024x597.jpg" class="attachment-large size-large wp-post-image" alt="Erik Voorhees: Bitcoin And Stablecoins Will Eventually Take The Place Of Gold And Bank Notes" style="margin-bottom: 15px;" srcset="https://zycrypto.com/wp-content/uploads/2020/04/Robert-Kiyosaki_-‘Save-Gold-Silver-Bitcoin’-Instead-of-the-Dollar-1024x597.jpg 1024w, https://zycrypto.com/wp-content/uploads/2020/04/Robert-Kiyosaki_-‘Save-Gold-Silver-Bitcoin’-Instead-of-the-Dollar-300x175.jpg 300w, https://zycrypto.com/wp-content/uploads/2020/04/Robert-Kiyosaki_-‘Save-Gold-Silver-Bitcoin’-Instead-of-the-Dollar-768x448.jpg 768w, https://zycrypto.com/wp-content/uploads/2020/04/Robert-Kiyosaki_-‘Save-Gold-Silver-Bitcoin’-Instead-of-the-Dollar-610x356.jpg 610w, https://zycrypto.com/wp-content/uploads/2020/04/Robert-Kiyosaki_-‘Save-Gold-Silver-Bitcoin’-Instead-of-the-Dollar.jpg 1200w" sizes="(max-width: 696px) 100vw, 696px" /></div>There have been talks of how Bitcoin could soon take over the financial world in a grand measure, and this idea seems to be winning over more people as time goes. So far, Bitcoin has been proven to be a pretty serious contestant against Gold when it comes to the future of reserve currency. Many [&#8230;] 

None of those other tags have been removed. Can someone please explain what I am doing wrong

Expected output

Contains only HTML with a and img tags and rest of the tags have been removed

PirateApp
  • 5,433
  • 4
  • 57
  • 90
  • Your question is unclear. Please give us an example of input and expected output. I also would suggest to remove the asyncio from your question because this is not related to it. Your question is about feedparser parsing a string (xml-content) downloaded or read with another package/library. – buhtz Aug 10 '20 at 20:51
  • @buhtz i have updated the question to include output – PirateApp Aug 11 '20 at 07:12
  • No you did not include output. Please include the expected output and do not just describe it. And btw: Please break your code blocks by 72 lines. Even your "Current Output" is unreadable because the readers have to sidescroll it. – buhtz Aug 11 '20 at 15:42
  • 1
    This was reported in [feedparser issue 222](https://github.com/kurtmckee/feedparser/issues/222) and is now fixed in the develop branch. It will be released in feedparser 6.0.0. In the future, please report bugs on GitHub. Thanks! – Kurt McKee Aug 30 '20 at 22:57

0 Answers0