Regular expression that takes <...> as one item in "foo bar and so on" (Goal: Simple music/lilypond parsing)

Question

I am using the re module in Python(3) and want to substitute (re.sub(regex, replace, string)) a string in the following format

"foo <bar e word> f ga <foo b>"

to

"#foo <bar e word> #f #ga <foo b>"

or even

"#foo #<bar e word> #f #ga #<foo b>"

But I can't isolate single words from word boundaries within a <...> construct.

Help would be nice!

P.S 1

The whole story is a musical one: I have strings in the Lilypond format (or better, a subset of the very simple core format, just notes and durations) and want to convert them to python pairs int(duration),list(of pitch strings). Performance is not important so I can convert them back and forth, iterate with python lists, split strings and join them again etc. But for the above problem I did not found an answer.

Source String

"c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"

should result in

[
(4, ["c'"]),
(8, ["d"]),
(16, ["e'", "g'"]),
(4, ["fis'"]),
(0, ["a,,"]),
(0, ["g", "b'"]),
(1, ["c''"]),
]

the basic format is String+Number like so : e4 bes16

List item
the string can consist of multiple, at least one, [a-zA-Z] chars
the string is followed by zero or more digits: e bes g4 c16
the string is followed by zero or more ' or , (not combined): e' bes, f'''2 g,,4
the string can be substituted by a list of strings, list limiters are <>: 4 The number comes behind the >, no space allowed

P.S. 2

The goal is NOT to create a Lilypond parser. Is it really just for very short snippets with no additional functionality, no extensions to insert notes. If this does not work I would go for another format (simplified) like ABC. So anything that has to do with Lilypond ("Run it trough lilypond, let it give out the music data in Scheme, parse that") or its toolchain is certainly NOT the answer to this question. The package is not even installed.

PaulMcG · Answer 1 · 2013-02-11T19:16:46.763

I know you are not looking for a general parser, but pyparsing makes this process very simple. Your format seemed very similar to the chemical formula parser that I wrote as one of the earliest pyparsing examples.

Here is your problem implemented using pyparsing:

from pyparsing import (Suppress,Word,alphas,nums,Combine,Optional,Regex,Group,
                       OneOrMore)

"""
List item
 -the string can consist of multiple, at least one, [a-zA-Z] chars
 -the string is followed by zero or more digits: e bes g4 c16
 -the string is followed by zero or more ' or , (not combined): 
  e' bes, f'''2 g,,4
 -the string can be substituted by a list of strings, list limiters are <>;
  the number comes behind the >, no space allowed
"""

LT,GT = map(Suppress,"<>")

integer = Word(nums).setParseAction(lambda t:int(t[0]))

note = Combine(Word(alphas) + Optional(Word(',') | Word("'")))
# or equivalent using Regex class
# note = Regex(r"[a-zA-Z]+('+|,+)?")

# define the list format of one or more notes within '<>'s
note_list = Group(LT + OneOrMore(note) + GT)

# each item is a note_list or a note, optionally followed by an integer; if
# no integer is given, default to 0
item = (note_list | Group(note)) + Optional(integer, default=0)

# reformat the parsed data as a (number, note_or_note_list) tuple
item.setParseAction(lambda t: (t[1],t[0].asList()) )

source = "c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"
print OneOrMore(item).parseString(source)

With this output:

[(4, ["c'"]), (8, ['d']), (16, ["e'", "g'"]), (4, ["fis'"]), (0, ['a,,']), 
 (0, ['g,', "b'"]), (1, ["c''"])]

this is very good and clean. Pyparsing would add an additional dependency, though. For a really small snippet helper function this is too much. — nilshi, Feb 11 '13 at 18:24

Justin O Barber · Accepted Answer · 2013-02-10T21:01:39.253

Your first question can be answered in this way:

>>> import re
>>> t = "foo <bar e word> f ga <foo b>"
>>> t2 = re.sub(r"(^|\s+)(?![^<>]*?>)", " #", t).lstrip()
>>> t2
'#foo #<bar e word> #f #ga #<foo b>'

I added lstrip() to remove the single space that occurs before the result of this pattern. If you want to go with your first option, you could simply replace #< with <.

Your second question can be solved in the following manner, although you might need to think about the , in a list like ['g,', "b'"]. Should the comma from your string be there or not? There may be a faster way. The following is merely proof of concept. A list comprehension might take the place of the final element, although it would be farily complicated.

>>> s = "c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"
>>> q2 = re.compile(r"(?:<)\s*[^>]*\s*(?:>)\d*|(?<!<)[^\d\s<>]+\d+|(?<!<)[^\d\s<>]+")
>>> s2 = q2.findall(s)
>>> s3 = [re.sub(r"\s*[><]\s*", '', x) for x in s2]
>>> s4 = [y.split() if ' ' in y else y for y in s3]
>>> s4
["c'4", 'd8', ["e'", "g'16"], "fis'4", 'a,,', ['g,', "b'"], "c''1"]
>>> q3 = re.compile(r"([^\d]+)(\d*)")
>>> s = []
>>> for item in s4:
    if type(item) == list:
            lis = []
            for elem in item:
                    lis.append(q3.search(elem).group(1))
                    if q3.search(elem).group(2) != '':
                            num = q3.search(elem).group(2)
            if q3.search(elem).group(2) != '':
                    s.append((num, lis))
            else:
                    s.append((0, lis))
    else:
            if q3.search(item).group(2) != '':
                    s.append((q3.search(item).group(2), [q3.search(item).group(1)]))
            else:
                    s.append((0, [q3.search(item).group(1)]))


>>> s
[('4', ["c'"]), ('8', ['d']), ('16', ["e'", "g'"]), ('4', ["fis'"]), (0, ['a,,']), (0, ['g,', "b'"]), ('1', ["c''"])]

I updated this answer with a quick and dirty solution to your first postscript. — Justin O Barber, Feb 10 '13 at 20:54

Regular expression that takes <...> as one item in "foo bar and so on" (Goal: Simple music/lilypond parsing)

2 Answers2