2

My program saves a bit of XML data to a file in a prettyfied format from an XML string. This does the trick:

from xml.dom.minidom import parseString
dom = parseString(strXML)
with open(file_name + ".xml", "w", encoding="utf8") as outfile:
    outfile.write(dom.toprettyxml())

However, I noticed that my XML header is missing an encoding parameter.

<?xml version="1.0" ?>

Since my data is susceptible of containing many Unicode characters, I must make sure UTF-8 is also specified in the XML encoding field.

Now, looking at the minidom documentation, I read that "an additional keyword argument encoding can be used to specify the encoding field of the XML header". So I try this:

from xml.dom.minidom import parseString
dom = parseString(strXML)
with open(file_name + ".xml", "w", encoding="utf8") as outfile:
    outfile.write(dom.toprettyxml(encoding="UTF-8"))

But then I get:

TypeError: write() argument must be str, not bytes

Why doesn't the first piece of code yield that error? And what am I doing wrong?

Thanks!

R.

mrgou
  • 1,576
  • 2
  • 21
  • 45
  • 2
    change "w" to "wb": done – Jean-François Fabre May 06 '18 at 19:21
  • OK, but then will be output file really be encoded in UTF-8? And most importantly, why? Why does adding the encoding argument in the toprettyxml method require me to open the file as binary when I don't need to otherwise? – mrgou May 06 '18 at 19:29

3 Answers3

3

from the documentation emphasis mine:

With no argument, the XML header does not specify an encoding, and the result is Unicode string if the default encoding cannot represent all characters in the document. Encoding this string in an encoding other than UTF-8 is likely incorrect, since UTF-8 is the default encoding of XML.

With an explicit encoding argument, the result is a byte string in the specified encoding. It is recommended that this argument is always specified. To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as “utf-8”.

So the write method outputs a different object type whether encoding is set or not (which is rather confusing if you ask me)

So you can fix by removing the encoding:

with open(file_name + ".xml", "w", encoding="utf8") as outfile:
    outfile.write(dom.toprettyxml())
    

or open your file in binary mode which then accepts byte strings to be written to

with open(file_name + ".xml", "wb") as outfile:
    outfile.write(dom.toprettyxml(encoding="utf8"))
Community
  • 1
  • 1
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • Ah that clarifies it (although I agree it's quite confusing). I hadn't seen that bit of explanation in the document I consulted. Many thanks!!! :-) – mrgou May 06 '18 at 19:40
  • This does not take line-ending into consideration. On Windows, line-ending is usually CRLF rather than LF only. Writing in `Binary` mode will not automatically convert line-endings, but `Text` mode will do. – guan boshen Dec 15 '21 at 02:51
  • okay, it is well known but line termination doesn't matter with xml files. – Jean-François Fabre Dec 15 '21 at 09:50
1

You can solve the problem as follow:

with open(targetName, 'wb') as f:
    f.write(dom.toprettyxml(indent='\t', encoding='utf-8'))
Tiger Wang
  • 11
  • 1
0

I don't recommend using wb mode for output, because it does not take line-ending conversion into consideration (which, for example, converts \n to \r\n on Windows when using Text mode). I instead use the following method to do this:

dom = minidom.parseString(utf_8_xml_text)

out_byte = dom.toprettyxml(encoding="utf-8")
out_text = out_byte.decode("utf-8")

with open(filename, "w", encoding="utf-8") as f:
    f.write(out_text)

For python version higher than 3.9, using built-in indent function instead.

guan boshen
  • 724
  • 7
  • 15