2

I have a Flask app that retrieves an XML document from a url and processes it. I'm using requests_cache with redis to avoid extra requests and ElementTree.iterparse to iterate over the streamed content. Here's an example of my code (same result occurs from both the development server and the interactive interpreter):

>>> import requests, requests_cache
>>> import xml.etree.ElementTree as ET
>>> requests_cache.install_cache('test', backend='redis', expire_after=300)
>>> url = 'http://myanimelist.net/malappinfo.php?u=doomcat55&status=all&type=anime'
>>> response = requests.get(url, stream=True)
>>> for event, node in ET.iterparse(response.raw):
...     print(node.tag)

Running the above code once throws a ParseError:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1301, in __next__
    self._root = self._parser._close_and_return_root()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1236, in _close_and_return_root
    root = self._parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

However, running the exact same code again before the cache expires actually prints the expected result! How come the XML parsing fails the first time only, and how can I fix it?


Edit: If it's helpful, I've noticed that running the same code without the cache results in a different ParseError:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1289, in __next__
    for event in self._parser.read_events():
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1272, in read_events
    raise event
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1230, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
Noah
  • 1,329
  • 11
  • 21
  • Interesting, if you would do `ET.iterparse(StringIO(response.text))` instead, it would work all of the time, though I guess you have a reason to use `.raw` in this case. – alecxe Jul 11 '16 at 03:44
  • @alecxe Hm, that seems to imply to me that the problem is caused because ET is trying to parse a document that hasn't fully loaded... I'm pretty sure it's possible to do this, though: http://stackoverflow.com/questions/18308529/python-requests-package-handling-xml-response – Noah Jul 11 '16 at 04:09
  • @alecxe, first run caching consumes the data, not caching means you are passing gzipped data which etree cannot parse – Padraic Cunningham Jul 11 '16 at 17:55

1 Answers1

0

I can tell you why it fails for both scenarios, for the latter it is because the data is gzipped the first time you call raw, whatever happens when you read the second time then the data is decompressed:

If you print the lines:

for line in response.raw:
    print(line)

You see:

�=V���H�������mqn˫+i�������UȣT����F,�-§�ߓ+���G�o~�����7�C�M{�3D����೺C����ݣ�i�����SD�݌.N�&�HF�I�֎�9���J�ķ����s�*H�@$p�o���Ĕ�Y��v�����8}I,��`�cy�����gE�� �!��B�  &|(^���jo�?�^,���H���^~p��a���׫��j�

����a۱Yk<qba�RN6�����l�/�W����{/��߸�G

X�LxH��哫 .���g(�MQ ����Y�q��:&��>s�M�d4�v|��ܓ��k��A17�

And then decompressing:

import zlib
def decomp(raw):
    decompressor = zlib.decompressobj(zlib.MAX_WBITS | 16)
    for line in raw:
        yield decompressor.decompress(line)

for line in decomp(response.raw):
    print(line)

You see the decompression works:

<?xml version="1.0" encoding="UTF-8"?>
<myanimelist><myinfo><user_id>4731313</user_id><user_name>Doomcat55</user_name><user_watching>3</user_watching><user_completed>120</user_completed><user_onhold>8</user_onhold><user_dropped>41</user_dropped><user_plantowatch>2</user_plantowatch><user_days_spent_watching>27.83</user_days_spent_watching></myinfo><anime><series_animedb_id>64</series_animedb_id><series_title>Rozen Maiden</series_title><series_synonyms>; Rozen Maiden</series_synonyms><series_type>1</series_type><series_episodes>12</series_episodes><series_status>2</series_status><series_start>2004-10-08</series_start><series_end>2004-12-24</series_end><series_image>http://cdn.myanimelist.net/images/anime/2/15728.jpg</series_image>
..................................

Now after caching, if we read a few bytes:

response.raw.read(39)

You see we get decompressed data:

<?xml version="1.0" encoding="UTF-8"?>

Forgetting caching and passing the response.raw to iterparse gives:

    raise e
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

Because it cannot handle gzipped data.

Also using the following on the first run while caching:

for line in response.raw:
    print(line)

Gives me:

    ValueError: I/O operation on closed file.

That is because the caching has consumed the data so there is in fact nothing there so not sure if using raw with caching is actually possible as the data is consumed and the file handle is closed.

If you use lxml.fromstringlist:

import requests, requests_cache
import lxml.etree as et
requests_cache.install_cache()

def lazy(resp):
    for line in resp.iter_content():
        yield line

url = 'http://myanimelist.net/malappinfo.php?u=doomcat55&status=all&type=anime'

response = requests.get(url, stream=True)

for node in et.fromstringlist(lazy(response)):
    print(node)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thanks! So on the first request, do you think it's possible to stream the response and decompress/parse it without loading the entire response at once? – Noah Jul 11 '16 at 20:45
  • You want to cache and iterate using the stream? – Padraic Cunningham Jul 11 '16 at 20:50
  • Yes. The XML can get quite large, so I'm trying to avoid having it all in memory at once. – Noah Jul 11 '16 at 20:53
  • Yeah, I'm actually trying it out atm. – Noah Jul 11 '16 at 21:09
  • @Noah, I added an example using iter_content, bar you write the data yourself you won't be able to use raw, the issue is that behind the scenes on the first run the cache us being filled so there is nothing left by the time you parse, you would have to tee the output yourself which would not be too hard but it does not seem to be available as is – Padraic Cunningham Jul 11 '16 at 21:17
  • Thanks, though this seems to be slower than just downloading the file all at once (w/o streaming) and then using iterparse. You've definitely given me enough to work with though, I should be able to figure out a solution on my own. – Noah Jul 11 '16 at 21:42
  • Steaming will definitely be slower, like I said i don't think the raw caching works out of the box but it would only take a little bit of code to handle that, basically write the raw data and read in tee like fashion. – Padraic Cunningham Jul 11 '16 at 21:44