1

I create a regular expression to find urls like /places/:state/:city/whatever

p = re.compile('^/places/(?P<state>[^/]+)/(?P<city>[^/]+).*$')

This works just fine:

import re

p = re.compile('^/places/(?P<state>[^/]+)/(?P<city>[^/]+).*$')
path = '/places/NY/NY/other/stuff'
match = p.match(path)
print match.groupdict()

Prints {'city': 'NY', 'state': 'NY'}.

How can I process a logfile to replace /places/NY/NY/other/stuff with the string "/places/:state/:city/other/stuff"? I'd like to get a sense of how many urls are of the "cities-type" without caring that the places are (NY, NY) specifically.

The simple approach can fail:

import re

p = re.compile('^/places/(?P<state>[^/]+)/(?P<city>[^/]+).*$')
path = '/places/NY/NY/other/stuff'
match = p.match(path)
if match:
  groupdict = match.groupdict()
  for k, v in sorted(groupdict.items()):
    path = path.replace(v, ':' + k, 1)
print path

Will print /places/:city/:state/other/stuff, which is backwards!

Feels like there should be some way to use re.sub but I can't see it.

Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
Rob Crowell
  • 1,447
  • 3
  • 15
  • 25
  • 1
    You've sorted the dict, so `city` comes before `state` during the replacement – Moses Koledoye Aug 02 '16 at 01:24
  • @MosesKoledoye is the value returned by `groupdict()` guaranteed to be sorted in the same order as the matches (or any particular order at all)? It seems to be just a built-in ``. – Rob Crowell Aug 02 '16 at 01:42
  • 2
    Yes, it's more or less the builtin `dict`. The ordering of the items in the dict will not reflect the order of the matches. – Moses Koledoye Aug 02 '16 at 02:05
  • Using re.findall you can get the captures in the right order and in re.sub you can replace the text with back references to the captured parts. – Wiktor Stribiżew Aug 02 '16 at 05:51
  • @WiktorStribiżew while that is true, unfortunately it doesn't give me the group name along with the matches, so I'd have to store that separately from the regex itself. – Rob Crowell Aug 02 '16 at 17:45

1 Answers1

1

Figured out a better way to do this. There is a property groupindex on a compiled regular expression which prints the groups and their orders in the pattern string:

>>> p = re.compile('^/places/(?P<state>[^/]+)/(?P<city>[^/]+).*$')
>>> p.groupindex
{'city': 2, 'state': 1}

Which can easily be iterated in the correct order:

>>> sorted(p.groupindex.items(), key=lambda x: x[1])
[('state', 1), ('city', 2)]

Using this, I should be able to guarantee that I replace matches in their correct left-to-right order:

p = re.compile('^/places/(?P<state>[^/]+)/(?P<city>[^/]+).*$')
path = '/places/NY/NY/other/stuff'
match = p.match(path)
if match:
    groupdict = match.groupdict()
    for k, _ in sorted(p.groupindex.items(), key=lambda x: x[1]):
        path = path.replace(groupdict[k], ':' + k, 1)
print path

This loops over the groups in the correct order, which ensures that the replacement also occurs in the correct order, reliably resulting in the correct string:

/places/:state/:city/other/stuff
Rob Crowell
  • 1,447
  • 3
  • 15
  • 25