I am trying to parse a ~200MB XML file using LXML. I was stupidly doing etree.parse(xml_path)
, without any encoding
parameter as argument, and then using iterwalk()
to iterate over some child nodes, thinking that it would lower memory consumption. It worked, and I could parse my entire XML file, albeit very slowly. Then I realized that on doing etree.parse(xml_path)
, the entire file is loaded in memory, so doing iterparse()
or iterwalk()
after that doesn't make sense.
So now, I am trying to directly do etree.iterparse(xml_path)
on the same file but I am getting
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 5123: invalid start byte
I tried using both encoding='utf-8'
and encoding='ISO-8859-1'
as arguments in iterparse()
but the error still remains. My XML file states that the encoding is 'ISO-8859-1'
.
TL;DR: etree.parse()
works but etree.iterparse()
fails due to an encoding error. I went through all the SO answers on iterparse() encoding but no one seems to have had this problem yet.