Regex to remove all except XML

Question

I need help with a Regex for notepad++ to match all but XML

The regex I'm using: (!?\<.*\>) <-- I want the opposite of this (in first three lines)

The example code:

[20173003] This text is what I want to delete [<Person><Name>Foo</Name><Surname>Bar</Surname></Person>], and this text too.
[20173003] This is another text to delete [<Person><Name>Bar</Name><Surname>Foo</Surname></Person>]
[20173003] This text too... [<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>], delete me!
[20173003] But things like this make the regex to fail < [<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>], or this>

Expected result:

<Person><Name>Foo</Name><Surname>Bar</Surname></Person>
<Person><Name>Bar</Name><Surname>Foo</Surname></Person>
<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>
<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>

Thanks in advance!

Could you please provide the expected result (especially for the last line)? — Wiktor Stribiżew, Mar 30 '17 at 11:02
I'm not sure what do you want to do with your input text! but last input line failing is because of using `.` in your pattern, change it to `[^<>]` to not include tag inside other tag names! and add `?` after `*` to stop capturing in first `>` meeting. — MohaMad, Mar 30 '17 at 11:05
I don't know what you're trying to do with `(!?\<.*\>)`. That will capture an optional exclamation mark `!` followed by an open fractur bracket, as many characters as possible, and a close fractur bracket. — Borodin, Mar 30 '17 at 12:27

Wiktor Stribiżew · Accepted Answer · 2017-03-30T11:52:49.093

This is not perfect, but should work with your input that looks quite simple and well-structured.

If you need to handle just a single unnested <Person> tag, you may use simple (<Person>.*?</Person>)|. regex (that will match and capture into Group 1 any <Person> tag and will match any other char) and replace with a conditional replacement pattern (?{1}$1\n:) (that will reinsert Person tag with a newline after it or will replace the match with an empty string):

To make it a bit more generic, you may capture the opening and corresponding closing XML tags with a recursion-based Boost regex, and the appropriate conditional replacement pattern:

Find What: (<(\w+)[^>]*>(?:(?!</?\2\b).|(?1))*</\2>)|.
Replace With: (?{1}$1\n:)
. matches newline: ON

Regex Details:

(<(\w+)[^>]*>(?:(?!</?\2\b).|(?1))*</\2>) - Capturing group 1 (that will be later recursed with the (?1) subrouting call) matching
- <(\w+)[^>]*> - any opening tag with its name captured into Group 2
- (?:(?!</?\2\b).|(?1))* - zero or more occurrences of:
  - (?!</?\2\b). - any char (.) not starting a sequence of </ + tag name as a whole word with an optional / in front
  - | - or
  - (?1) - the whole Group 1 subpattern is recursed (repeated)
- </\2> - the corresponding closing tag
| - or
. - any single char.

Replacement pattern:

(?{1} - if Group 1 matches:
- $1\n - replace with its contents + a newline
- : - else replace with an empty string
) - end of the replacement pattern.

I added a simlified version in case you just need to deal with 1 unnested `` tag. — Wiktor Stribiżew, Mar 30 '17 at 11:53

Regex to remove all except XML

1 Answers1