Python how to strip white-spaces from xml text nodes

Question

I have a xml file as follows

<Person>
<name>

 My Name

</name>
<Address>My Address</Address>
</Person>

The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml.

I found this but it trims only which are between tags not the value https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/

Update 1 - Handle following xml which has tail spaces in <name> tag

<Person>
<name>

 My Name<shortname>My</short>

</name>
<Address>My Address</Address>
</Person>

Accepted answer handle above both kind of xml's

Update 2 - I have posted my version in answer below, I am using it to remove all kind of whitespaces and generate pretty xml in file with xml encodings

https://stackoverflow.com/a/19396130/973699

@Temere this is being used by other app and before it comes to my python program for validation — DevC, Oct 10 '13 at 10:15
Your additional example (on which the accepted answer does not work) is not well-formed. Be careful with the start- and end tags. And btw, you are "moving the goalposts". I think you should ask a new question. — mzjn, Oct 15 '13 at 17:39
@mzjn Yes, I could have asked a different questions, but I thought it is related, so it will great if best solution exist in this thread itself. I have posted my version in answer http://stackoverflow.com/a/19396130/973699 — DevC, Oct 16 '13 at 05:58
@mzjn I just read this, I will take care next time . http://meta.stackexchange.com/questions/153360/how-to-avoid-repositioning-of-goalposts-when-using-improve-details-bounty — DevC, Oct 16 '13 at 06:12

Birei · Accepted Answer · 2013-10-15T21:49:00.487

7

With lxml you can iterate over all elements and check if it has text to strip():

from lxml import etree

tree = etree.parse('xmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()

print(etree.tostring(root))

It yields:

<Person><name>My Name</name>
<Address>My Address</Address>
</Person>

UPDATE to strip tail text too:

from lxml import etree

tree = etree.parse('xmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()
    if elem.tail is not None:
        elem.tail = elem.tail.strip()

print(etree.tostring(root, encoding="utf-8", xml_declaration=True))

edited Oct 15 '13 at 21:49

answered Oct 11 '13 at 06:19

Birei

35,723
2
77
82

Perfect it worked, the only issue is that it doesn't preserve the xml version tag like – DevC Oct 15 '13 at 10:47
@DevC: Your data didn't include it. You can add it when printing: `print(etree.tostring(root, encoding="utf-8", xml_declaration=True))` – Birei Oct 15 '13 at 11:00
No, the actual xml file which I am trying have this but the final result doesn't show this. I don't want to hard code it as it may be changed. – DevC Oct 15 '13 at 11:06
I am keeping you answer accepted, but it doesn't work in above update condition – DevC Oct 15 '13 at 17:19
@DevC: This mixed-content of an element surrounded with text in both sides, can be extracted with the `tail` property that can be stripped too. I've updated the answer to adapt to your new case. – Birei Oct 15 '13 at 20:15

DevC · Answer 2 · 2013-10-16T06:00:03.917

Accepted answer given by Birei using lxml does the job perfectly, but I wanted to trim all kind of white/blank space, blank lines and regenerate pretty xml in a xml file.

Following code did what I wanted

from lxml import etree

#discard strings which are entirely white spaces
myparser = etree.XMLParser(remove_blank_text=True)

root = etree.parse('xmlfile',myparser)

#from Birei's answer 
for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()
    if elem.tail is not None:
        elem.tail = elem.tail.strip()

#write the xml file with pretty print and xml encoding
root.write('xmlfile', pretty_print=True, encoding="utf-8", xml_declaration=True)

score 2 · Answer 3 · answered Oct 10 '13 at 12:01

2

You have to do xml parsing for this one way or another, so maybe use xml.sax and copy to the output stream at each event (skipping ignorableWhitespace), and add tag markers as needed. Check the sample code here http://www.knowthytools.com/2010/03/sax-parsing-with-python.html.

answered Oct 10 '13 at 12:01

Basel Shishani

7,735
6
50
67

It would've really helped if you showed an example. Can't get to the link from work, and it's the only 2.3-compatible standard library answer here. – Jul 20 '18 at 14:03

score 1 · Answer 4 · answered Oct 10 '13 at 12:24

1

You can use beautifulsoup. Do traverse all elements and for each one that contains some text, replace it with its stripped version:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')

for elem in soup.find_all():
    if elem.string is not None:
        elem.string = elem.string.strip()

print(soup)

Assuming xmlfile with the content provided in the question, it yields:

<?xml version="1.0" encoding="utf-8"?>
<Person>
<name>My Name</name>
<Address>My Address</Address>
</Person>

answered Oct 10 '13 at 12:24

Birei

35,723
2
77
82

1

I'm assuming BeautifulSoup would abstract some details of xml, so he's better dealing with lxml or other parser directly to have a general solution, unless his xml is controlled. – Basel Shishani Oct 10 '13 at 12:55
1

@Birei yes as mentioned by Basel I am looking for doing it with xml parser like lxml/etree or minidom – DevC Oct 10 '13 at 23:45

score 1 · Answer 5 · answered Jul 20 '18 at 15:01

I'm working with an older version of Python (2.3), and I'm currently stuck with the standard library. To show an answer that's greatly backwards compatible, I've written this with xml.dom and xml.minidom functions.

import codecs
from xml.dom import minidom

# Read in the file to a DOM data structure.
original_document = minidom.parse("original_document.xml")

# Open a UTF-8 encoded file, because it's fairly standard for XML.
stripped_file = codecs.open("stripped_document.xml", "w", encoding="utf8")

# Tell minidom to format the child text nodes without any extra whitespace.
original_document.writexml(stripped_file, indent="", addindent="", newl="")

stripped_file.close()

While it's not BeautifulSoup, this solution is pretty elegant and uses the full force of the lower-level API. Note that the actual formatting is just one line :)

Documentation of API calls used here:

Python how to strip white-spaces from xml text nodes

5 Answers5

Linked