Encoding error in LXML etree.iterparse() but not in etree.parse()

Asked Mar 19 '21 at 19:23

Active Mar 26 '21 at 06:09

Viewed 197 times

I am trying to parse a ~200MB XML file using LXML. I was stupidly doing etree.parse(xml_path), without any encoding parameter as argument, and then using iterwalk() to iterate over some child nodes, thinking that it would lower memory consumption. It worked, and I could parse my entire XML file, albeit very slowly. Then I realized that on doing etree.parse(xml_path), the entire file is loaded in memory, so doing iterparse() or iterwalk() after that doesn't make sense.

So now, I am trying to directly do etree.iterparse(xml_path) on the same file but I am getting

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 5123: invalid start byte

I tried using both encoding='utf-8' and encoding='ISO-8859-1' as arguments in iterparse() but the error still remains. My XML file states that the encoding is 'ISO-8859-1'.

TL;DR: etree.parse() works but etree.iterparse() fails due to an encoding error. I went through all the SO answers on iterparse() encoding but no one seems to have had this problem yet.

edited Mar 26 '21 at 06:09

asked Mar 19 '21 at 19:23

Kevin Doshi

Encoding error in LXML etree.iterparse() but not in etree.parse()

0 Answers0