2

I have a python script that parsing an xml file and is returning the following error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 614617: character maps to <undefined>

I'm pretty sure the error is occurring because there are some illegal characters within the xml document I am trying to parse, however I don't have access to directly fix this particular xml file that I am reading from.

Am I able to have it so that these characters don't trip up my script and allows it to keep parsing without error?

This is the part of the script tat is reading the xml and decoding it:

def ReadXML(self, path):
    self.logger.info("Reading XML from %s" % path)
    codec = "Windows-1252"
    xmlReader = open(path, "r")
    return xmlReader.read().decode(codec)
bigmike7801
  • 3,908
  • 9
  • 49
  • 77

1 Answers1

7

When you call decode(), you can pass the optional errors argument. By default it is set to strict (which raises an error if it finds something it can't parse), but you can also set it to replace (which replaces the problematic character with \ufffd) or ignore (which just leaves the problematic character out).

So it would be:

return xmlReader.read().decode(codec, errors='ignore')

or whatever level you choose.

More info can be found in the Python Unicode HOWTO.

Niklas B.
  • 92,950
  • 18
  • 194
  • 224
cjm
  • 3,703
  • 1
  • 16
  • 18
  • I actually just tried: `return xmlReader.read().decode(codec, 'ignore')` and that seemed to work fine. Is that the same as what you mentioned? – bigmike7801 Mar 06 '12 at 19:23
  • 2
    @bigmike7801: If you look at [the docs](http://docs.python.org/library/stdtypes.html#str.decode), you see that the second positional parameter is `errors`, so yes, it's the same. Reading documentation is always encouraged. – Niklas B. Mar 06 '12 at 19:25