2

I have some well-behaved xml files I want to reformat (NOT PARSE!) using regex. The goal is to have every <trkpt> pairs as oneliners.

The following code works, but I'd like to get the operations performed in a single regex substitution instead of the loop, so that I don't need to concatenate the strings back.

import re

xml = """
    <trkseg>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
    </trkseg>
"""

for trkpt in re.findall('<trkpt.*?</trkpt>', xml, re.DOTALL):
    print re.sub('>\s*<', '><', trkpt, re.DOTALL)

An answer using sed would also be welcome.

Thanks for reading

heltonbiker
  • 26,657
  • 28
  • 137
  • 252

4 Answers4

2

How about this:

>>> regex = re.compile(
    r"""\n[ \t]*  # Match a newline plus following whitespace
    (?=           # only if... 
     (?:          # ...the following can be matched:
      (?!<trkpt)  #  (unless an opening <trkpt> tag occurs first)
      .           #  any character
     )*           # any number of times,
     </trkpt>     # followed by a closing </trkpt> tag
    )             # End of lookahead""", 
    re.DOTALL | re.VERBOSE)
>>> print regex.sub("", xml)

    <trkseg>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
    </trkseg>
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • It didn't work for me the first time. Had to strip the comments from the multiline pattern, and the `re.VERBOSE` flag, like this: `regex = re.compile('\n[ \t]*(?=(?:(?!)', re.DOTALL)`. Then it worked, but "ate" the indentation (not big problem, I plan to prettyprint the result anyway). – heltonbiker Aug 30 '12 at 22:50
  • @heltonbiker: Sorry, I had forgotten the `r` prefix for the string when I changed my regex into a verbose one. Now it should work correctly. Sorry for not answering sooner but it was past midnight here when you wrote your comment and I only read it just now. – Tim Pietzcker Aug 31 '12 at 06:34
1

This isn't really what you were asking for, but here's a one-liner for the sake of being a one-liner:

>>> print re.sub(r'(<trkpt.*?</trkpt>)',
                 lambda m: re.sub(r'>\s*<', '><', m.group(1), re.DOTALL),
                 xml, flags=re.DOTALL)

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>

Also note that this approach will break if any string attributes contain the string "<trkpt", which probably won't happen, but that's the problem with not using a real parser.

Danica
  • 28,423
  • 6
  • 90
  • 122
1

Do you want to keep the <trkseg>? If so, this could work for you:

print re.sub('([^gt])>\s*<', '\g<1>><', xml, re.DOTALL)

Removes all spaces between elements, on condition that the previous element does not end with t or g.

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>
user711413
  • 761
  • 5
  • 12
  • It works! Could you explain how/why this works? Which regex CONCEPTS are used here? I couldn't identify where in the command it is specific for ...? – heltonbiker Aug 30 '12 at 21:23
  • 1
    The [^tg] bit matches any character that is not t or g. It is between parentheses, so that whatever character it matches can be added by the \g<1> in the replace string. – user711413 Aug 30 '12 at 21:29
  • 1
    Yeah, but still there could have, in other files, tags with `g` or `t` that would spoil the trick... :o( – heltonbiker Aug 30 '12 at 21:30
  • Indeed. If there are many files with different tags, then you will need to use additional knowledge, like how many elements do you have in the pair, whether you have more nesting or not… – user711413 Aug 30 '12 at 21:37
1

Another one-liner is

print re.sub("(<trkpt.+?>).*?(<time>.+?</time>).*?(<ele>.+?</ele>).*?(</trkpt>)",
             r'\1\2\3\4', xml, re.DOTALL)

produces

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
</trkseg>

This has the advantage of being easy to change for other tags.

BrtH
  • 2,610
  • 16
  • 27