How do I pass raw untrusted text to feedparser.parse method in Python?

Question

I am trying to use feedparser to parse text which I download using asyncio aiohttp library
The feed text is available HERE (Large document, hence not pasting here)
The documentation of feedparser.parse method mentions that you should not send an untrusted string directly to it HERE on GitHub

So here is my code where I am trying to wrap it into StringIO class

import feedparser
import io

def read():
    import os
    name = os.path.join(os.getcwd(), 'extras', 'feeds',
                        'zycrypto.com_1596955288219')
    f = open(name, "r")
    text = f.read()
    f.close()
    return text

text = read()
parsed = feedparser.parse(io.StringIO(text))
for i in parsed.entries:
    print(i.summary, '\n')

However I keep getting this error

Traceback (most recent call last):
  File "./server/python/test.py", line 14, in <module>
    parsed = feedparser.parse(io.StringIO(text))
  File "/Users/zup/.local/share/virtualenvs/myapp_v3-kUGnE3_O/lib/python3.7/site-packages/feedparser.py", line 3922, in parse
    data, result['encoding'], error = convert_to_utf8(http_headers, data)
  File "/Users/zup/.local/share/virtualenvs/myapp_v3-kUGnE3_O/lib/python3.7/site-packages/feedparser.py", line 3574, in convert_to_utf8
    xml_encoding_match = RE_XML_PI_ENCODING.match(tempdata)
TypeError: cannot use a bytes pattern on a string-like object

How do I pass untrusted text to the Python feedparser.parse method to make the sanitizer work on it? My feed has script tags which have not been removed. Thank you in advance

Why do you use `StringIO`, instead of passing `text`, or even `f`, to `feedparser.parse`? — mkrieger1, Aug 09 '20 at 08:41
@mkrieger1 because the documentation says dont pass untrusted string, i have highlighted that in the question and in my actual application i get the data from asyncio aiohttp library, i wanted to create a simple case here which can be reproduced, you will notice that with the test data i have included the html is not sanitized at all, the script tags are still present in the final output — PirateApp, Aug 09 '20 at 08:48
@mkrieger1 so how do you sanitize the output because the current output is not at all sanitized, it still has those script tags when you loop over entries — PirateApp, Aug 09 '20 at 08:52

score 1 · Accepted Answer · answered Aug 09 '20 at 09:01

1

Apparently feedparser.parse internally expects a bytes object where it is currently receiving a string, because it passes that object to a regex matching function where it uses a bytes pattern, and the object to match and the pattern need to have the same type.

You can get a bytes object by changing open(..., 'r') to open(..., 'rb') and using BytesIO instead of StringIO.

answered Aug 09 '20 at 09:01

mkrieger1

19,194
5
54
65

works but still does not sanitize, as per SO rules i ll have to ask a separate question :( – PirateApp Aug 09 '20 at 09:43

score 0 · Answer 2 · answered Aug 09 '20 at 09:44

As per @mkrieger1 s comment, this would be the answer

import feedparser
import io

def read():
    import os
    name = os.path.join(os.getcwd(), 'extras', 'feeds',
                        'zycrypto.com_1596955288219')
    f = open(name, "r")
    text = f.read()
    f.close()
    return text

text = read()
parsed = feedparser.parse(io.BytesIO(bytes(text, 'utf-8')))
for i in parsed.entries:
    print(i.summary, '\n')

How do I pass raw untrusted text to feedparser.parse method in Python?

2 Answers2