0
  • I am trying to use feedparser to parse text which I download using asyncio aiohttp library
  • The feed text is available HERE (Large document, hence not pasting here)
  • The documentation of feedparser.parse method mentions that you should not send an untrusted string directly to it HERE on GitHub

So here is my code where I am trying to wrap it into StringIO class

import feedparser
import io

def read():
    import os
    name = os.path.join(os.getcwd(), 'extras', 'feeds',
                        'zycrypto.com_1596955288219')
    f = open(name, "r")
    text = f.read()
    f.close()
    return text

text = read()
parsed = feedparser.parse(io.StringIO(text))
for i in parsed.entries:
    print(i.summary, '\n')

However I keep getting this error

Traceback (most recent call last):
  File "./server/python/test.py", line 14, in <module>
    parsed = feedparser.parse(io.StringIO(text))
  File "/Users/zup/.local/share/virtualenvs/myapp_v3-kUGnE3_O/lib/python3.7/site-packages/feedparser.py", line 3922, in parse
    data, result['encoding'], error = convert_to_utf8(http_headers, data)
  File "/Users/zup/.local/share/virtualenvs/myapp_v3-kUGnE3_O/lib/python3.7/site-packages/feedparser.py", line 3574, in convert_to_utf8
    xml_encoding_match = RE_XML_PI_ENCODING.match(tempdata)
TypeError: cannot use a bytes pattern on a string-like object
  • How do I pass untrusted text to the Python feedparser.parse method to make the sanitizer work on it? My feed has script tags which have not been removed. Thank you in advance
PirateApp
  • 5,433
  • 4
  • 57
  • 90
  • 1
    Why do you use `StringIO`, instead of passing `text`, or even `f`, to `feedparser.parse`? – mkrieger1 Aug 09 '20 at 08:41
  • @mkrieger1 because the documentation says dont pass untrusted string, i have highlighted that in the question and in my actual application i get the data from asyncio aiohttp library, i wanted to create a simple case here which can be reproduced, you will notice that with the test data i have included the html is not sanitized at all, the script tags are still present in the final output – PirateApp Aug 09 '20 at 08:48
  • @mkrieger1 so how do you sanitize the output because the current output is not at all sanitized, it still has those script tags when you loop over entries – PirateApp Aug 09 '20 at 08:52

2 Answers2

1

Apparently feedparser.parse internally expects a bytes object where it is currently receiving a string, because it passes that object to a regex matching function where it uses a bytes pattern, and the object to match and the pattern need to have the same type.

You can get a bytes object by changing open(..., 'r') to open(..., 'rb') and using BytesIO instead of StringIO.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
0

As per @mkrieger1 s comment, this would be the answer

import feedparser
import io

def read():
    import os
    name = os.path.join(os.getcwd(), 'extras', 'feeds',
                        'zycrypto.com_1596955288219')
    f = open(name, "r")
    text = f.read()
    f.close()
    return text

text = read()
parsed = feedparser.parse(io.BytesIO(bytes(text, 'utf-8')))
for i in parsed.entries:
    print(i.summary, '\n')
PirateApp
  • 5,433
  • 4
  • 57
  • 90