Python regex to remove whitespace inside a pattern match

Question

I have some well-behaved xml files I want to reformat (NOT PARSE!) using regex. The goal is to have every <trkpt> pairs as oneliners.

The following code works, but I'd like to get the operations performed in a single regex substitution instead of the loop, so that I don't need to concatenate the strings back.

import re

xml = """
    <trkseg>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
    </trkseg>
"""

for trkpt in re.findall('<trkpt.*?</trkpt>', xml, re.DOTALL):
    print re.sub('>\s*<', '><', trkpt, re.DOTALL)

An answer using sed would also be welcome.

Thanks for reading

Is it `trkseg` or `trkpt` that you want as one line? You state `trkseg` but your regex works on `trkpt`... — KRyan, Aug 30 '12 at 21:04
Also, I assume that whichever it is, it's impossible to have nested tags of that type? As soon as you have nesting, regex isn't going to be able to handle it. — KRyan, Aug 30 '12 at 21:05
If this is a "quick and dirty" script that you're doing, and you don't want to parse xml, I would say a for loop is simpler and much more readable than a crazy regex. — Alexander Kondratskiy, Aug 30 '12 at 21:05
@AlexanderKondratskiy Although it is a QnD job, I want to use the spare time to learn regex. It's part of the challenge :o) — heltonbiker, Aug 30 '12 at 21:09

Tim Pietzcker · Accepted Answer · 2012-08-31T06:32:47.827

2

How about this:

>>> regex = re.compile(
    r"""\n[ \t]*  # Match a newline plus following whitespace
    (?=           # only if... 
     (?:          # ...the following can be matched:
      (?!<trkpt)  #  (unless an opening <trkpt> tag occurs first)
      .           #  any character
     )*           # any number of times,
     </trkpt>     # followed by a closing </trkpt> tag
    )             # End of lookahead""", 
    re.DOTALL | re.VERBOSE)
>>> print regex.sub("", xml)

    <trkseg>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
    </trkseg>

edited Aug 31 '12 at 06:32

answered Aug 30 '12 at 21:53

Tim Pietzcker

328,213
58
503
561

It didn't work for me the first time. Had to strip the comments from the multiline pattern, and the `re.VERBOSE` flag, like this: `regex = re.compile('\n[ \t]*(?=(?:(?!)', re.DOTALL)`. Then it worked, but "ate" the indentation (not big problem, I plan to prettyprint the result anyway). – heltonbiker Aug 30 '12 at 22:50
@heltonbiker: Sorry, I had forgotten the `r` prefix for the string when I changed my regex into a verbose one. Now it should work correctly. Sorry for not answering sooner but it was past midnight here when you wrote your comment and I only read it just now. – Tim Pietzcker Aug 31 '12 at 06:34

score 1 · Answer 2 · answered Aug 30 '12 at 21:11

This isn't really what you were asking for, but here's a one-liner for the sake of being a one-liner:

>>> print re.sub(r'(<trkpt.*?</trkpt>)',
                 lambda m: re.sub(r'>\s*<', '><', m.group(1), re.DOTALL),
                 xml, flags=re.DOTALL)

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>

Also note that this approach will break if any string attributes contain the string "<trkpt", which probably won't happen, but that's the problem with not using a real parser.

It's a nice manoeuver, but I'm afraid the extra trickyness renders the code too hard to read. Thanks anyway! — heltonbiker, Aug 30 '12 at 21:18

score 1 · Answer 3 · answered Aug 30 '12 at 21:19

1

Do you want to keep the <trkseg>? If so, this could work for you:

print re.sub('([^gt])>\s*<', '\g<1>><', xml, re.DOTALL)

Removes all spaces between elements, on condition that the previous element does not end with t or g.

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>

answered Aug 30 '12 at 21:19

user711413

761
5
12

It works! Could you explain how/why this works? Which regex CONCEPTS are used here? I couldn't identify where in the command it is specific for ...? – heltonbiker Aug 30 '12 at 21:23
1

The [^tg] bit matches any character that is not t or g. It is between parentheses, so that whatever character it matches can be added by the \g<1> in the replace string. – user711413 Aug 30 '12 at 21:29
1

Yeah, but still there could have, in other files, tags with `g` or `t` that would spoil the trick... :o( – heltonbiker Aug 30 '12 at 21:30
Indeed. If there are many files with different tags, then you will need to use additional knowledge, like how many elements do you have in the … pair, whether you have more nesting or not… – user711413 Aug 30 '12 at 21:37

score 1 · Answer 4 · answered Aug 30 '12 at 21:25

Another one-liner is

print re.sub("(<trkpt.+?>).*?(<time>.+?</time>).*?(<ele>.+?</ele>).*?(</trkpt>)",
             r'\1\2\3\4', xml, re.DOTALL)

produces

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
</trkseg>

This has the advantage of being easy to change for other tags.

Unfortunately, I wouldn't like to depend on the specific orders of the tags inside ... — heltonbiker, Aug 30 '12 at 21:28

Python regex to remove whitespace inside a pattern match

4 Answers4