Regex Python / group quantifiers

Question

I want to match a list of variables which look like directories, e.g.:

Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123
Same/Same2/Battery/Name=SomeString
Same/Same2/Home/Land/Some/More/Stuff=0.34

The length of the "subdirectories" is variable having an upper bound (above it's 9). I want to group every subdirectory except the 1st one which I named "Same" above.

The best I could come up with is:

^(?:([^/]+)/){4,8}([^/]+)=(.*)

It already looks for 4-8 subdirectories but only groups the last one. Why's that? Is there a better solution using group quantifiers?

Edit: Solved. Will use split() instead.

Any reason you can't use `split` and other regular Python string management functions instead of using regular expressions? — Helgi, Jul 24 '11 at 12:38
The list contains thousands of these directories which are to be parsed into variable names after a specific convention, e.g. Same/Same2/Battery/Name=SomeString becoming SAME2_BATTERY_NAME=SomeString. Is there a better way than regex? — runDOSrun, Jul 24 '11 at 12:39
Hm. Thanks, never thought of split. Makes it a lot easier and faster! :) — runDOSrun, Jul 24 '11 at 12:44
btw, your solution doesn't work because with repeating group patterns, only the last match is returned — , Jul 24 '11 at 12:52
so, just our of curiosity - there's no (whatsoever obvious) solution using python regex? — runDOSrun, Jul 24 '11 at 12:55
@malch Yes there is, see my solution. But I wonder what advantage we gain to use a regex instead of split() as Andrew White does. Maybe the fact that it can be more easily adapted to either Windows or Linux platform by using os.sep — eyquem, Jul 24 '11 at 12:59
@malch, I removed the "(SOLVED)" from the title of your question. Please accept the answer that resolved the issue for you by clicking the "V"-like shaped icon next to it. — Bart Kiers, Jul 24 '11 at 13:22

score 2 · Answer 1 · 2011-07-25T13:40:02.920

2

I probably misunderstood what exactly you want to do, but here is how you would do it without regex:

for entry in list_of_vars:
    key, value = entry.split('=')
    key_components = key.split('/')
    if 4 <= len(key_components) <= 8:
        # here the actual work is done
        print "%s=%s" % ('_'.join(key_components[1:]).upper(), value)

edited Jul 25 '11 at 13:40

answered Jul 24 '11 at 12:47

@hop I don't see the interest to split according to '=' to obtain the key value that is then splitted. Function split() has a default argument that can be set to 1 to split only on the first '/' – eyquem Jul 24 '11 at 15:45

eyquem · Accepted Answer · 2011-07-24T22:01:57.920

import re

regx = re.compile('(?:(?<=\A)|(?<=/)).+?(?=/|\Z)')


for ss in ('Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123',
           'Same/Same2/Battery/Name=SomeString',
           'Same/Same2/Home/Land/Some/More/Stuff=0.34'):

    print ss
    print regx.findall(ss)
    print

Edit 1

Now you have given more info on what you want to obtain ( _"Same/Same2/Battery/Name=SomeString becoming SAME2_BATTERY_NAME=SomeString"_ ) better solutions can be proposed: either with a regex or with split() , + replace()

import re
from os import sep

sep2 = r'\\' if sep=='\\' else '/'

pat = '^(?:.+?%s)(.+$)' % sep2
print 'pat==%s\n' % pat

ragx = re.compile(pat)

for ss in ('Same\Same2\Foot\Ankle\Joint\Actuator\Sensor\Temperature\Value=4.123',
           'Same\Same2\Battery\Name=SomeString',
           'Same\Same2\Home\Land\Some\More\Stuff=0.34'):

    print ss
    print ragx.match(ss).group(1).replace(sep,'_')
    print ss.split(sep,1)[1].replace(sep,'_')
    print

result

pat==^(?:.+?\\)(.+$)

Same\Same2\Foot\Ankle\Joint\Actuator\Sensor\Temperature\Value=4.123
Same2_Foot_Ankle_Joint_Actuator_Sensor_Temperature_Value=4.123
Same2_Foot_Ankle_Joint_Actuator_Sensor_Temperature_Value=4.123

Same\Same2\Battery\Name=SomeString
Same2_Battery_Name=SomeString
Same2_Battery_Name=SomeString

Same\Same2\Home\Land\Some\More\Stuff=0.34
Same2_Home_Land_Some_More_Stuff=0.34
Same2_Home_Land_Some_More_Stuff=0.34

Edit 2

Re-reading your comment, I realized that I didn't take in account that you want to upper the part of the strings that lies before the '=' sign but not after it.

Hence, this new code that exposes 3 methods that answer this requirement. You will choose which one you prefer:

import re

from os import sep
sep2 = r'\\' if sep=='\\' else '/'



pot = '^(?:.+?%s)(.+?)=([^=]*$)' % sep2
print 'pot==%s\n' % pot
rogx = re.compile(pot)

pet = '^(?:.+?%s)(.+?(?==[^=]*$))' % sep2
print 'pet==%s\n' % pet
regx = re.compile(pet)


for ss in ('Same\Same2\Foot\Ankle\Joint\Sensor\Value=4.123',
           'Same\Same2\Battery\Name=SomeString',
           'Same\Same2\Ocean\Atlantic\North=',
           'Same\Same2\Maths\Addition\\2+2=4\Simple=ohoh'):
    print ss + '\n' + len(ss)*'-'

    print 'rogx groups  '.rjust(32),rogx.match(ss).groups()

    a,b = ss.split(sep,1)[1].rsplit('=',1)
    print 'split split  '.rjust(32),(a,b)
    print 'split split join upper replace   %s=%s' % (a.replace(sep,'_').upper(),b)

    print 'regx split group  '.rjust(32),regx.match(ss.split(sep,1)[1]).group()
    print 'regx split sub  '.rjust(32),\
          regx.sub(lambda x: x.group(1).replace(sep,'_').upper(), ss)
    print

result, on a Windows platform

pot==^(?:.+?\\)(.+?)=([^=]*$)

pet==^(?:.+?\\)(.+?(?==[^=]*$))

Same\Same2\Foot\Ankle\Joint\Sensor\Value=4.123
----------------------------------------------
                   rogx groups   ('Same2\\Foot\\Ankle\\Joint\\Sensor\\Value', '4.123')
                   split split   ('Same2\\Foot\\Ankle\\Joint\\Sensor\\Value', '4.123')
split split join upper replace   SAME2_FOOT_ANKLE_JOINT_SENSOR_VALUE=4.123
              regx split group   Same2\Foot\Ankle\Joint\Sensor\Value
                regx split sub   SAME2_FOOT_ANKLE_JOINT_SENSOR_VALUE=4.123

Same\Same2\Battery\Name=SomeString
----------------------------------
                   rogx groups   ('Same2\\Battery\\Name', 'SomeString')
                   split split   ('Same2\\Battery\\Name', 'SomeString')
split split join upper replace   SAME2_BATTERY_NAME=SomeString
              regx split group   Same2\Battery\Name
                regx split sub   SAME2_BATTERY_NAME=SomeString

Same\Same2\Ocean\Atlantic\North=
--------------------------------
                   rogx groups   ('Same2\\Ocean\\Atlantic\\North', '')
                   split split   ('Same2\\Ocean\\Atlantic\\North', '')
split split join upper replace   SAME2_OCEAN_ATLANTIC_NORTH=
              regx split group   Same2\Ocean\Atlantic\North
                regx split sub   SAME2_OCEAN_ATLANTIC_NORTH=

Same\Same2\Maths\Addition\2+2=4\Simple=ohoh
-------------------------------------------
                   rogx groups   ('Same2\\Maths\\Addition\\2+2=4\\Simple', 'ohoh')
                   split split   ('Same2\\Maths\\Addition\\2+2=4\\Simple', 'ohoh')
split split join upper replace   SAME2_MATHS_ADDITION_2+2=4_SIMPLE=ohoh
              regx split group   Same2\Maths\Addition\2+2=4\Simple
                regx split sub   SAME2_MATHS_ADDITION_2+2=4_SIMPLE=ohoh

Awesome. Thx for offering a solution which shows both approaches so elegantly! — runDOSrun, Jul 24 '11 at 17:38

score 1 · Answer 3 · answered Jul 24 '11 at 12:45

1

Just use split?

>>> p='Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123'
>>> p.split('/')
['Same', 'Same2', 'Foot', 'Ankle', 'Joint', 'Actuator', 'Sensor', 'Temperature', 'Value=4.123']

Also, if you want that key/val pair you can do something like this...

>>> s = p.split('/')
>>> s[-1].split('=')
['Value', '4.123']

answered Jul 24 '11 at 12:45

Andrew White

52,720
19
113
137

I agree with this approach. The final list of variables should however be `p.split('/')[1:]`, as to leave out the initial `Same` value. – mac Jul 24 '11 at 12:49

PaulMcG · Answer 4 · 2011-07-25T00:47:58.140

A couple of variations on your theme. For one, I've always found regexen to be cryptic to the point of unmaintainable, so I wrote the pyparsing module. In my mind, I look at your code and think, "oh, it's a list of '/'-delimited strings, an '=' sign, and then some kind of rvalue." And that translates pretty directly into the pyparsing parser definition code. By adding a name here and there in the parser ("key" and "value", similar to named groups in regex), the output is pretty easily processed.

data="""\
Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123
Same/Same2/Battery/Name=SomeString
Same/Same2/Home/Land/Some/More/Stuff=0.34""".splitlines()

from pyparsing import Word, alphas, alphanums, Word, nums, QuotedString, delimitedList

wd = Word(alphas, alphanums)
number = Word(nums+'+-', nums+'.').setParseAction(lambda t:float(t[0]))
rvalue = wd | number | QuotedString('"')

defn = delimitedList(wd, '/')('key') + '=' + rvalue('value')

for d in data:
    result = defn.parseString(d)

Second, I question your approach at defining all of those variable names - creating variable names on the fly based on your data is a pretty well-recognized Code Smell (not necessarily bad, but you might really want to rethink this approach). I used a recursive defaultdict to create a navigable structure so that you can easily do operations like "find all the entries that are sub-elements of "Same2" (in this case, "Foot", "Battery", and "Home") - this kind of work is more difficult when trying to sift through some collection of variable names as found in locals(), it seems to me you will end up re-parsing these names to reconstruct the key hierarchy.

from collections import defaultdict

class recursivedefaultdict(defaultdict):
    def __init__(self, attrFactory=int):
        self.default_factory = lambda : type(self)(attrFactory)
        self._attrFactory = attrFactory
    def __getattr__(self, attr):
        newval = self._attrFactory()
        setattr(self, attr, newval)
        return newval

table = recursivedefaultdict()

# parse each entry, and accumulate into hierarchical dict
for d in data:
    # use pyparsing parser, gives us key (list of names) and value
    result = defn.parseString(d)
    t = table
    for k in result.key[:-1]:
        t = t[k]
    t[result.key[-1]] = result.value


# recursive method to iterate over hierarchical dict    
def showTable(t, indent=''):
    for k,v in t.items():
        print indent+k,
        if isinstance(v,dict):
            print
            showTable(v, indent+'  ')
        else:
            print v

showTable(table)

Prints:

Same
  Same2
    Foot
      Ankle
        Joint
          Actuator
            Sensor
              Temperature
                Value 4.123
    Battery
      Name SomeString
    Home
      Land
        Some
          More
            Stuff 0.34

If you are really set on defining those variable names, then adding some helpful parse actions to pyparsing will reformat the parsed data at parse time, so that it's directly processable afterwards:

wd = Word(alphas, alphanums)
number = Word(nums+'+-', nums+'.').setParseAction(lambda t:float(t[0]))
rvaluewd = wd.copy().setParseAction(lambda t: '"%s"' % t[0])
rvalue = rvaluewd | number | QuotedString('"')

defn = delimitedList(wd, '/')('key') + '=' + rvalue('value')

def joinNamesWithAllCaps(tokens):
    tokens["key"] = '_'.join(map(str.upper, tokens.key))
defn.setParseAction(joinNamesWithAllCaps)

for d in data:
    result = defn.parseString(d)
    print result.key,'=', result.value

Prints:

SAME_SAME2_FOOT_ANKLE_JOINT_ACTUATOR_SENSOR_TEMPERATURE_VALUE = 4.123
SAME_SAME2_BATTERY_NAME = "SomeString"
SAME_SAME2_HOME_LAND_SOME_MORE_STUFF = 0.34

(Note that this also encloses your SomeString value in quotes, so that the resulting assignment statement is valid Python.)

Regex Python / group quantifiers

4 Answers4

Edit 1

Edit 2