1

For a project, I want to use feedparser. Basicly I got it working.

In the documentation section about sanitization is described, that not all content types are sanitized. How can I force feedparser to do this on all content types?

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Martin
  • 4,170
  • 6
  • 30
  • 47
  • Are you sure you want to? Feedparser is pretty strict on what it allows. It whitelists, not blacklists, to be sure that only safe things are allowed. What are you worried will get through? – fitzgeraldsteele Feb 20 '12 at 04:57
  • The documentation says, that content type 'text/plain' is not sanitized, so I have to do it on my own if I want to have safe content. But it would be nice, if feedparser could do this. – Martin Feb 20 '12 at 11:06

1 Answers1

1

I think the feedparser doc page you referenced gives good advice:

*It is recommended that you check the content type in e.g. entries[i].summary_detail.type. If it is text/plain then it has not been sanitized (and you should perform HTML escaping before rendering the content).*

import cgi
import feedparser

d = feedparser.parse('http://rss.slashdot.org/Slashdot/slashdot')

# iterate through entries. If the type is not text/html, HTML clean it
for entry in d.entries:
    if entry.summary_detail.type != 'text/html':
        print cgi.escape(entry.summary)
else:
    print entry.summary

Of course, there are dozens of ways you can iterate through the entries depending on what you want to do with them once they are clean.

FogleBird
  • 74,300
  • 25
  • 125
  • 131
fitzgeraldsteele
  • 4,547
  • 3
  • 24
  • 25
  • To get it even more save, I had a look into feedparser code. It seems, that only text/html is really sanitized, so I test if type is not text/html and then sanitize by myself. But except that detail, your answer is totally correct. – Martin Feb 24 '12 at 08:43