Force feedparser to sanitize on all content types

Question

For a project, I want to use feedparser. Basicly I got it working.

In the documentation section about sanitization is described, that not all content types are sanitized. How can I force feedparser to do this on all content types?

Are you sure you want to? Feedparser is pretty strict on what it allows. It whitelists, not blacklists, to be sure that only safe things are allowed. What are you worried will get through? — fitzgeraldsteele, Feb 20 '12 at 04:57
The documentation says, that content type 'text/plain' is not sanitized, so I have to do it on my own if I want to have safe content. But it would be nice, if feedparser could do this. — Martin, Feb 20 '12 at 11:06

score 1 · Accepted Answer · edited Feb 28 '12 at 21:30

I think the feedparser doc page you referenced gives good advice:

*It is recommended that you check the content type in e.g. entries[i].summary_detail.type. If it is text/plain then it has not been sanitized (and you should perform HTML escaping before rendering the content).*

import cgi
import feedparser

d = feedparser.parse('http://rss.slashdot.org/Slashdot/slashdot')

# iterate through entries. If the type is not text/html, HTML clean it
for entry in d.entries:
    if entry.summary_detail.type != 'text/html':
        print cgi.escape(entry.summary)
else:
    print entry.summary

Of course, there are dozens of ways you can iterate through the entries depending on what you want to do with them once they are clean.

To get it even more save, I had a look into feedparser code. It seems, that only text/html is really sanitized, so I test if type is not text/html and then sanitize by myself. But except that detail, your answer is totally correct. — Martin, Feb 24 '12 at 08:43

Force feedparser to sanitize on all content types

1 Answers1