5

I have a xml file as follows

<Person>
<name>

 My Name

</name>
<Address>My Address</Address>
</Person>

The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml.

I found this but it trims only which are between tags not the value https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/

Update 1 - Handle following xml which has tail spaces in <name> tag

<Person>
<name>

 My Name<shortname>My</short>

</name>
<Address>My Address</Address>
</Person>

Accepted answer handle above both kind of xml's

Update 2 - I have posted my version in answer below, I am using it to remove all kind of whitespaces and generate pretty xml in file with xml encodings

https://stackoverflow.com/a/19396130/973699

Community
  • 1
  • 1
DevC
  • 7,055
  • 9
  • 39
  • 58
  • You may have more success with JSON – Temere Oct 10 '13 at 09:32
  • @Temere this is being used by other app and before it comes to my python program for validation – DevC Oct 10 '13 at 10:15
  • Your additional example (on which the accepted answer does not work) is not well-formed. Be careful with the start- and end tags. And btw, you are "moving the goalposts". I think you should ask a new question. – mzjn Oct 15 '13 at 17:39
  • 1
    @mzjn Yes, I could have asked a different questions, but I thought it is related, so it will great if best solution exist in this thread itself. I have posted my version in answer http://stackoverflow.com/a/19396130/973699 – DevC Oct 16 '13 at 05:58
  • @mzjn I just read this, I will take care next time . http://meta.stackexchange.com/questions/153360/how-to-avoid-repositioning-of-goalposts-when-using-improve-details-bounty – DevC Oct 16 '13 at 06:12

5 Answers5

7

With lxml you can iterate over all elements and check if it has text to strip():

from lxml import etree

tree = etree.parse('xmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()

print(etree.tostring(root))

It yields:

<Person><name>My Name</name>
<Address>My Address</Address>
</Person>

UPDATE to strip tail text too:

from lxml import etree

tree = etree.parse('xmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()
    if elem.tail is not None:
        elem.tail = elem.tail.strip()

print(etree.tostring(root, encoding="utf-8", xml_declaration=True))
Birei
  • 35,723
  • 2
  • 77
  • 82
  • Perfect it worked, the only issue is that it doesn't preserve the xml version tag like – DevC Oct 15 '13 at 10:47
  • @DevC: Your data didn't include it. You can add it when printing: `print(etree.tostring(root, encoding="utf-8", xml_declaration=True))` – Birei Oct 15 '13 at 11:00
  • No, the actual xml file which I am trying have this but the final result doesn't show this. I don't want to hard code it as it may be changed. – DevC Oct 15 '13 at 11:06
  • I am keeping you answer accepted, but it doesn't work in above update condition – DevC Oct 15 '13 at 17:19
  • @DevC: This mixed-content of an element surrounded with text in both sides, can be extracted with the `tail` property that can be stripped too. I've updated the answer to adapt to your new case. – Birei Oct 15 '13 at 20:15
3

Accepted answer given by Birei using lxml does the job perfectly, but I wanted to trim all kind of white/blank space, blank lines and regenerate pretty xml in a xml file.

Following code did what I wanted

from lxml import etree

#discard strings which are entirely white spaces
myparser = etree.XMLParser(remove_blank_text=True)

root = etree.parse('xmlfile',myparser)

#from Birei's answer 
for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()
    if elem.tail is not None:
        elem.tail = elem.tail.strip()

#write the xml file with pretty print and xml encoding
root.write('xmlfile', pretty_print=True, encoding="utf-8", xml_declaration=True)
DevC
  • 7,055
  • 9
  • 39
  • 58
2

You have to do xml parsing for this one way or another, so maybe use xml.sax and copy to the output stream at each event (skipping ignorableWhitespace), and add tag markers as needed. Check the sample code here http://www.knowthytools.com/2010/03/sax-parsing-with-python.html.

Basel Shishani
  • 7,735
  • 6
  • 50
  • 67
  • It would've really helped if you showed an example. Can't get to the link from work, and it's the only 2.3-compatible standard library answer here. –  Jul 20 '18 at 14:03
1

You can use . Do traverse all elements and for each one that contains some text, replace it with its stripped version:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')

for elem in soup.find_all():
    if elem.string is not None:
        elem.string = elem.string.strip()

print(soup)

Assuming xmlfile with the content provided in the question, it yields:

<?xml version="1.0" encoding="utf-8"?>
<Person>
<name>My Name</name>
<Address>My Address</Address>
</Person>
Birei
  • 35,723
  • 2
  • 77
  • 82
  • 1
    I'm assuming BeautifulSoup would abstract some details of xml, so he's better dealing with lxml or other parser directly to have a general solution, unless his xml is controlled. – Basel Shishani Oct 10 '13 at 12:55
  • 1
    @Birei yes as mentioned by Basel I am looking for doing it with xml parser like lxml/etree or minidom – DevC Oct 10 '13 at 23:45
1

I'm working with an older version of Python (2.3), and I'm currently stuck with the standard library. To show an answer that's greatly backwards compatible, I've written this with xml.dom and xml.minidom functions.

import codecs
from xml.dom import minidom

# Read in the file to a DOM data structure.
original_document = minidom.parse("original_document.xml")

# Open a UTF-8 encoded file, because it's fairly standard for XML.
stripped_file = codecs.open("stripped_document.xml", "w", encoding="utf8")

# Tell minidom to format the child text nodes without any extra whitespace.
original_document.writexml(stripped_file, indent="", addindent="", newl="")

stripped_file.close()

While it's not BeautifulSoup, this solution is pretty elegant and uses the full force of the lower-level API. Note that the actual formatting is just one line :)

Documentation of API calls used here: