Ignoring a xml Tag in the middle of the file in Regex (with non capturing group ?)

Question

I have an xml with an embeded tag, and I would like to capture everthing but the FType Tags... in python regex.

<xml>
<EType>
<E></E>
<F></F>
<FType><E1></E1><E2></E2></FType>
<FType><E1></E1><E2></E2></FType>
<FType><E1></E1><E2></E2></FType>
<G></G>
</EType>
</xml>

I tried :

(?P<xml>.*(?=<FType>.*<FType>).*)

But it give me everything ;-(

I Expect :

<xml>
<EType>
<E></E>
<F></F>
<G></G>
</EType>
</xml>

`in python regex` ... [whyyyyyyy](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1)? please use a parser. — mata, Oct 18 '13 at 07:58
You "want" everything but the `` tag, but seeing your XML structure, the `` is part of the `` tag and the `` tag. Logically it would be included in your results. Would you please show us the expected results. Also to save yourself some trouble, you may consider a proper parser ... — HamZa, Oct 18 '13 at 08:03
I use Python regex, because I need this for splunk search string... — OpenStove, Oct 18 '13 at 08:42
Your updated expected result is apparently the empty string. For that, I would match the expression `r''`. — abarnert, Oct 18 '13 at 08:45
I don't know why stack was not displaying it, I chanched from code to quotes... now its here :-) — OpenStove, Oct 18 '13 at 08:49
Meanwhile, splunk [search](http://docs.splunk.com/Documentation/Splunk/6.0/SearchReference/Search) does not take regexps. Splunk [regex](http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Regex) takes Perl regexps, not Python. Splunk also appears to have [XML](http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Xmlkv) functionality. So… why do you need Python regexps? — abarnert, Oct 18 '13 at 08:49

score 2 · Answer 1 · answered Oct 18 '13 at 08:50

2

No need for regular expressions:

In [1]: x = '''    
<xml>
<EType>
<E></E>
<F></F>
<FType><E1></E1><E2></E2></FType>
<FType><E1></E1><E2></E2></FType>
<FType><E1></E1><E2></E2></FType>
<G></G>
</EType>
</xml>'''

In [2]: y = '\n'.join([tag for tag in x.split() if not tag.startswith('<FType>')])

In [3]: print y
<xml>
<EType>
<E></E>
<F></F>
<G></G>
</EType>
</xml>

answered Oct 18 '13 at 08:50

Chris Seymour

83,387
30
160
202

Thank you, but I need this for splunk search language, so I can't use python code around the regex... – OpenStove Oct 18 '13 at 09:04
I don't see that mentioned in the question anywhere? I don't know what splunk is but it sounds like the wrong tool for the job. – Chris Seymour Oct 18 '13 at 09:09

score 1 · Answer 2 · answered Oct 18 '13 at 08:54

1

One way using beautifulsoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')
for elem in soup.find_all('FType'):
    elem.decompose()

print(soup.prettify())

It yields:

<?xml version="1.0" encoding="utf-8"?>
<xml>
 <EType>
  <E/>
  <F/>
  <G/>
 </EType>
</xml>

answered Oct 18 '13 at 08:54

Birei

35,723
2
77
82

Thank you, but I need this for splunk search language, so I can't use python code around the regex... – OpenStove Oct 18 '13 at 09:05

score 1 · Accepted Answer · answered Oct 18 '13 at 08:54

There are at least four problems with your expression.

First, you're capturing everything from <xml> to </xml> in one big group. This means that if you manage to exclude the FType bits, you'll get nothing at all; if you don't, you'll get everything. If you create three separate groups, and make the middle one non-capturing, that will let you exclude the middle one.

Second, you're trying to exclude everything from <FType> to <FType>, which isn't going to work. The closing tag is </FType>.

Third, you're using greedy matches everywhere, so even if you get the first two right, you're going to match everything up to the last FType, including any earlier FTypes.

Putting it all together:

>>> re.match(r'(?P<xml>.*?)(?:<FType>.*</FType>)(.*)', s, re.DOTALL).groups()
('<xml>\n<EType>\n<E></E>\n<F></F>\n', '\n<G></G>\n</EType>\n</xml>\n')

If you ''.join that together, or sub it to r'\1\2', etc., you'll get the desired output.

Fourth, this is, of course, horribly brittle. But parsing a non-regular language like XML with regexps is guaranteed to be horribly brittle (or very complex and sometimes exponentially slow), which is why you shouldn't do it. But that's what you asked for.

And if you're trying to use this with a function that doesn't take regexp patterns, or one that takes a different regexp syntax than Python's, this probably isn't going to help you very much.

It realy helps, if I understanding, it is not possible to concatenate group1 and 3 directly in regex ? But like this I can handle it in Splunk, thank you. The fact is that the xml parser only can handle with 5000 characters in Splunk so I have to use regex ;-( — OpenStove, Oct 18 '13 at 09:23

score 1 · Answer 4 · answered Oct 18 '13 at 09:52

After reading your updated question and all other answers, I thought why do you even match ?.
You could just remove <FType>...</FType> by using a replace function.

import re

string = "<xml>\
<EType>\
<E></E>\
<F></F>\
<FType><E1></E1><E2></E2></FType>\
<FType><E1></E1><E2></E2></FType>\
<FType><E1></E1><E2></E2></FType>\
<G></G>\
</EType>\
</xml>"

result = re.sub(r'(?i)<ftype>.*?</ftype>[\r\n]*', r'', string)

print result.replace("<", "&lt;").replace(">", "&gt;<br>") # the replace function is just for the output

Explanation:

(?i) : enable the i modifier to match case insensitive
<ftype> : match <ftype>
.*? : match everything ungreedy until ...
</ftype> : match </ftype>
[\r\n]* : match \r or \n zero or more times

Online demo

Ignoring a xml Tag in the middle of the file in Regex (with non capturing group ?)

4 Answers4