0

I need to make a change in xml file using python and need to remove the next line / white space . For example

I tried regex like but not it not helped. I am trying to open a input.xml and replace with regex and save output as output.xml

Dim RegexObj As New Regex(">[\s]*<")
Newxml = RegexObj.Replace(OldText, "><")

Input.xml

<Instal xmlns="http://www.test.com/abc/dfg">
<Version>
    1.1
</Version>
<alpha>
    <ns3:myname xmlns:ns3="http://www.test.com/asd/asd/cvf">
        GH12345
    </ns3:myname>
    <ns4:beta xmlns:ns4="http://www.test.com/asd/asd/cvf">
        PLAN
    </ns4:beta>
    <ns5:OperatorName xmlns:ns5="http://www.test.com/asd/asd/cvf">
        Tanho
    </ns5:OperatorName>
</alpha>
<Laptop>
    A
</Laptop>
<ID>
    2883
</ID>
<PERSON>
    <ns6:FirstName xmlns:ns6="http://www.test.com/asd/asd/cvf">
        MAMA
    </ns6:FirstName>
    <ns7:LastName xmlns:ns7="http://www.test.com/asd/asd/cvf">
        REHA
    </ns7:LastName>
</PERSON>
</Instal xmlns="http://www.test.com/abc/dfg">

Output.xml

<Instal xmlns="http://www.test.com/abc/dfg">
<Version>1.1</Version>
<alpha>
    <ns3:> xmlns:ns3="http://www.test.com/asd/asd/cvf">GH12345</ns3:myname>
    <ns4:beta xmlns:ns4="http://www.test.com/asd/asd/cvf">PLAN</ns4:beta>
    <ns5:OperatorName xmlns:ns5="http://www.test.com/asd/asd/cvf">Tanho</ns5:OperatorName>
</alpha>
<Laptop>A</Laptop>
<ID>2883</ID>
<PERSON>
    <ns6:FirstName xmlns:ns6="http://www.test.com/asd/asd/cvf">MAM</ns6:FirstName>
    <ns7:LastName xmlns:ns7="http://www.test.com/asd/asd/cvf">REHA</ns7:LastName>
</PERSON>
</Instal xmlns="http://www.test.com/abc/dfg">
robin hood
  • 21
  • 2
  • What is your question? What is not working in your current program? You need to define your problem more clearly - there are many instances in your output where the newline was *not* removed between angle brackets - what makes those special? – Tim Pietzcker Oct 13 '14 at 07:13
  • In output.cml, has some childs(like beta,operator), so that need to be require in correct format similarly with – robin hood Oct 13 '14 at 07:27
  • The code you have posted is not in Python? – Burhan Khalid Oct 13 '14 at 16:23

2 Answers2

0

You can do it using non greedy expressions, and re.DOTALL to select a pattern containing only :

  • an opening tag with name and optional attributes
  • eventual new lines
  • a text but no child tags
  • eventual new lines
  • a closing tag (same name as opening tag)

The replacement string only removes the eventual newlines

with open('input.xml') as fd:
    txt = fd.read()
rx = re.compile("(<\s*(.*?)(\s*[^>]*?)>)\s*\n*\s*([^<]*?)\s*\n*\s*(</\s*\\2\s*>)", re.DOTALL)
filtered = rx.sub("\\1\\4\\5", txt)

If txt is your Input.xml, print(filtered) gives :

<Instal xmlns="http://www.test.com/abc/dfg">
<Version>1.1</Version>
<alpha>
    <ns3:myname xmlns:ns3="http://www.test.com/asd/asd/cvf">GH12345</ns3:myname>
    <ns4:beta xmlns:ns4="http://www.test.com/asd/asd/cvf">PLAN</ns4:beta>
    <ns5:OperatorName xmlns:ns5="http://www.test.com/asd/asd/cvf">Tanho</ns5:OperatorName>
</alpha>
<Laptop>A</Laptop>
<ID>2883</ID>
<PERSON>
    <ns6:FirstName xmlns:ns6="http://www.test.com/asd/asd/cvf">MAMA</ns6:FirstName>
    <ns7:LastName xmlns:ns7="http://www.test.com/asd/asd/cvf">REHA</ns7:LastName>
</PERSON>
</Instal xmlns="http://www.test.com/abc/dfg">

The current regex is not tolerant with diffent use cases between opening and closing tag. If you need it, you will have to add re.I to the flags.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • @ Serge: I am unable to understand, and I tried below code , which give output as "input.xml" rx = re.compile("(<\s*(.*?)(\s*[^>]*?)>)\s*\n*\s*([^<]*?)\s*\n*\s*(\s*\\2\s*>)", re.DOTALL) filtered = rx.sub("\\1\\4\\5", "Input.xml") print filtered – robin hood Oct 13 '14 at 12:38
  • @robinhood: txt is the **content** of the file. See my edit above. – Serge Ballesta Oct 13 '14 at 13:50
0

I just used a simple regex. Of course, my answer is in python 2.7 so this may not work for you, depending on the version of python you are using.

input = ''
with open('input.xml', 'r') as input_file:
    input_file = open('input.xml', 'r')
    input = input_file.read()

import re
output = re.sub('\n\s*([^<> ]+)\s*\n\s*', '\\1', input, flags=re.MULTILINE)

with open('output.xml', 'w') as output_file:
    output_file.write(output)

Here's a working repl: http://repl.it/1SG/3

EDIT

This will not work if your values contain greater than or less than signs. I'm not sure how XML works entirely but it may not even allow those characters as values anyway.

Bryce Siedschlaw
  • 4,136
  • 1
  • 24
  • 36