0

I am new to python, using 3.x, and am running into an issue with an XML file that I'm testing/learning on. When I look at the raw file (which is ASCII encoded btw), the issue (I'm pretty sure) is that there's a U+00A0 code in there.

The XML is as follows:

<?xml version="1.0" encoding="utf-8"?>
<XMLSetData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.clientsite.com/subdir/r2.4/v1">
  <FileCreationDate>2018-05-05T11:35:44.1043858-05:00</FileCreationDate>
  <XMLSetDataList>
    <DataIDNumber>99345346</DataIDNumber>
    <DataName>RSRS TVL5697 ULL  Georgetown</DataName>
  </XMLSetDataList>
</XMLSetData>

Using Notepad++, it shows me that the text has "xA0 " instead of " " (two spaces) between ULL and Georgetown. So when I do the code below:

import xml.etree.ElementTree as ET    
events = ("end", "start-ns", "end-ns")

for event, elem in ET.iterparse(xml_file, events=events):
        if event == "end":
            eltag = elem.tag
            eltext = elem.text
            print( eltag, eltext)

It gives me an error stating:

  File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1222, in iterator
    yield from pullparser.read_events()
  File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1297, in read_events
    raise event
  File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1269, in feed
    self._parser.feed(data)
  File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 6, column 30

How do I fix this / get around it? If I remove the xA0 part, it parses fine, but obviously something like this may come up again, and I'd like to programmatically handle it.

D W
  • 79
  • 1
  • 10
  • How are you reading the file? If you used `open()`, then that's your mistake. – Tomalak Aug 17 '18 at 08:24
  • I'm not, I'm passing the filepath in xml_file variable, since iterparse handles that. Also, it will eventually be a very large xml file, so I was playing with it with that in mind. – D W Aug 17 '18 at 15:28
  • An ASCII-encoded XML file is rather rare - how does the XML declaration look like? `00A0` looks like UTF-16 to me (https://www.fileformat.info/info/unicode/char/00a0/index.htm) – Tomalak Aug 17 '18 at 19:49
  • The XML Declaration says ....but if that were true, would iterparse have an issue with it, or would that character be allowed? Plus, the documentation of the XML file states that the files are encoded in ASCII (apparently regardless of what the actual declaration says). I don't know of a way to verify that it's ASCII or UTF-8? – D W Aug 17 '18 at 20:07
  • Generally I recommend not giving a rat's ass at what the documentation says, any documentation is wishful thinking. Some are closer to reality than others, this one is off. Always look at what you *actually* have. If the file were produced by a tool that knows anything about XML, then the XML declaration would tell the truth, and loading via `.parse()`/`.iterparse()` would simply work. That's the whole point of having an XML declaration after all. It looks like the file *wasn't* produced by such a tool and all bets are off. `00A0` definitely is not a valid byte sequence in UTF-8. – Tomalak Aug 18 '18 at 00:07
  • Then again, I have no idea what that website you are using is doing. `A0` is the non-breaking space, it's really not an unusual character. And the parser error says "invalid token", which is not the same thing as a character encoding error. Look at the file with a capable text editor at line 8376. – Tomalak Aug 18 '18 at 00:28
  • Notepad++ says it's xA0 . Is there a way to check for / remove / convert that stuff before I try .iterparse() ? What can I do to move around it? Obviously I could open it and delete it and change it to a space myself, but if it comes up in the future, I'd like to programmatically handle it. – D W Aug 18 '18 at 01:54
  • The A0 is not your problem. That's a completely regular character, leave it be. The XML has a different issue. Reduce the XML to the smallest possible part that reproduces the exact same error message, i.e. throw out everything that works, and show the remaining XML in your question. – Tomalak Aug 18 '18 at 02:06
  • A bare xA0 character isn't valid UTF-8. Try changing the XML declaration to read and retry the parse. – cco Aug 18 '18 at 03:26
  • That works, except the resulting output shows it as "RSRS TVL5697 ULLÂ Georgetown". Which is fine to me actually. Why does that work, and how do I fix that programmatically before I try to parse it? Just write the line to the file head? – D W Aug 18 '18 at 04:23
  • How is the XML file being treated before you try to open it in Python? Is it processed in any way? Is it downloaded from somewhere? With what tool? I'm not a fan at all of manually forcing an encoding on an XML file because it breaks data - as you can see, the letter `Â` is not really supposed to be in there, and it's an indication that the file might be UTF-8 after all. Let's assume the file is produced properly and some sort of wrong handling breaks it, identifying and fixing the point where the file breaks is far preferable to forcing the first-best encoding that doesn't throw an error. – Tomalak Aug 18 '18 at 06:42
  • It's a file generated by a client's db system that I receive via FTP. It's stored in a gzip until it's unpacked and I can then process it. – D W Aug 18 '18 at 08:21

0 Answers0