1

I am trying to parse a ~200MB XML file using LXML. I was stupidly doing etree.parse(xml_path), without any encoding parameter as argument, and then using iterwalk() to iterate over some child nodes, thinking that it would lower memory consumption. It worked, and I could parse my entire XML file, albeit very slowly. Then I realized that on doing etree.parse(xml_path), the entire file is loaded in memory, so doing iterparse() or iterwalk() after that doesn't make sense.

So now, I am trying to directly do etree.iterparse(xml_path) on the same file but I am getting

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 5123: invalid start byte

I tried using both encoding='utf-8' and encoding='ISO-8859-1' as arguments in iterparse() but the error still remains. My XML file states that the encoding is 'ISO-8859-1'.

TL;DR: etree.parse() works but etree.iterparse() fails due to an encoding error. I went through all the SO answers on iterparse() encoding but no one seems to have had this problem yet.

Kevin Doshi
  • 13
  • 1
  • 6

0 Answers0