0

I'm trying to get a Gzipped XML file from an FTP server, parse the XML, and pull out data using Xpaths all without having to store the files on disk. The code I've got is:

FTP.connect(hostname)
FTP.login(user,pass)

flo = io.BytesIO()

FTP.retrbinary('RETR myfile.xml.gz',flo.write)
flo.seek(0,0)
uncompressed = gzip.decompress(flo.read())
tree = etree.parse(uncompressed,etree.XMLParser(encoding='utf-8', ns_clean=True, recover=True))

Up until the etree.parse() call everything works well, after which I get the contents of the XML file printed to screen prepended with: OSError: Error reading file 'b'<?xml version="1.0" ... and ending with failed to load external entity "b'<?xml version="1.0" encoding="UTF-8"?><merchandiser xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNam

If I write the uncompressed file to disk first and then load it back in, the parse command works. I've tried parsing with using a parser that has resolve_entities=False, but nothing changes in the output.

I've seen posts such as Error 'failed to load external entity' when using Python lxml - however they refer to trying to parse a string with etree.parse() whereas I'm dealing with a byte object

type(uncompressed)
<class 'bytes'> 

Any help is much appreciated. Thanks

splotsh
  • 53
  • 1
  • 8
  • It should be possible to use `parse()` on the compressed .gz file by the way. – mzjn Nov 26 '20 at 10:52
  • @mzjn The whole point was not to write anything to disk during this process :) I solved this yesterday, passing io.BytesIO(uncompressed) to parse. Parse does not appear to accept an object of type 'bytes'. I did miss in the documentation that parse() accepts gzipped XML files as well, that will save me a few lines of code. Thanks – splotsh Nov 27 '20 at 10:18
  • Instead of `parse()`, try `fromstring()`. It can take both `str` and `bytes` objects. – mzjn Nov 27 '20 at 17:17

0 Answers0