11

I would like to split a Python multiline string at its commas, except when the commas are inside a bracketed expression. E.g., the string

{J. Doe, R. Starr}, {Lorem
{i}psum dolor }, Dol. sit., am. et.

Should be split into

['{J. Doe, R. Starr}', '{Lorem\n{i}psum dolor }', 'Dol. sit.', 'am. et.']

This involves bracket matching, so probably regexes are not helping out here. PyParsing has commaSeparatedList which almost does what I need except that quoted (") environments are protected instead of {}-delimited ones.

Any hints?

Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
  • AFAIK Python doesn't support recursion in regexes. Just for reference, this would [do the job](http://regex101.com/r/qD4zV8/1) with PCRE: `(?'braces'\{(?:[^{}]++|\g)*\})(*SKIP)(*FAIL)|,` – Lucas Trzesniewski Nov 07 '14 at 19:50
  • this is no trivial thing you ask ... ... regexes are not helpfull because you require a state machine with memory in order to match enclosing items ..(brakcets,quotes,etc) – Joran Beasley Nov 07 '14 at 19:50
  • It can't be done without recursive regex (ones that do recursion). I thought Python has a newer version that does this now. Funny how Perl comes from Python, Perl leaves it in the dust. –  Nov 07 '14 at 20:23
  • @sln: What does "Perl comes from Python" mean? Perl was already around when Guido started thinking about Python, was in widespread use long before most people even heard of Python, and was an influence in Python's development through the 1.x/early-2.x days. Python's `re` engine, in particular, is directly based on Perl's. And I'm not sure that being able to spend exponential time on a regexp without warning counts as "leaves it in the dust"… – abarnert Nov 07 '14 at 20:28
  • @abarnert - I think Perl may have been around but Perl adopted much functionality, which was first I don't know. I briefly thought I read some details on Python re beta site, seems quite a few new things are coming about using the available syntax constructs. –  Nov 07 '14 at 23:35
  • @sln: Perl was first. Guido says that it had an influence on Python, although mostly a negative one (in other words, he did a few things the opposite of perl, on purpose), but the regex syntax in particular is directly based on Perl's. Larry Wall, meanwhile, says he never seriously looked at Python until after Perl 5. So, the idea that Perl "adopted much functionality" from Python is just as wrong as the idea that "Perl comes from Python". – abarnert Nov 07 '14 at 23:54
  • @sln: Also, I have no idea what you mean by "beta site", but the Python 3.5 [pre-alpha docs](https://docs.python.org/3.5/) and [release schedule PEP](http://legacy.python.org/dev/peps/pep-0478/) have nothing about regex, and there's nothing on -ideas or -dev. There is a [`regex`](https://pypi.python.org/pypi/regex/) library that's been underway for a few years and may at some point replace `re`, but as you can see, it doesn't have any major extensions to the syntax; in fact, one of the goals is to make it easier to detect exponential backtracking, not to make it easier to do accidentally. – abarnert Nov 07 '14 at 23:56
  • @abarnert - I'm going to have to check that (when I get a chance). I parse all regex constructs for most engines. I think I was recently looking to add Python, went searching for it, thought I found that re beta page that showed some quite exotic constructs. I was impressed with the complexity, but deferred since it wasn't released. –  Nov 08 '14 at 00:09

3 Answers3

17

Write your own custom split-function:

 input_string = """{J. Doe, R. Starr}, {Lorem
 {i}psum dolor }, Dol. sit., am. et."""


 expected = ['{J. Doe, R. Starr}', '{Lorem\n{i}psum dolor }', 'Dol. sit.', 'am. et.']

 def split(s):
     parts = []
     bracket_level = 0
     current = []
     # trick to remove special-case of trailing chars
     for c in (s + ","):
         if c == "," and bracket_level == 0:
             parts.append("".join(current))
             current = []
         else:
             if c == "{":
                 bracket_level += 1
             elif c == "}":
                 bracket_level -= 1
             current.append(c)
     return parts

 assert split(input_string), expected
deets
  • 6,285
  • 29
  • 28
  • nice work ... :) this is the only current correct answer afaik – Joran Beasley Nov 07 '14 at 19:56
  • Good, but the assumption with this implementation is there are no "{" or "}" characters in the string which may not be part of a grouping. i.e. ":-}" If that possibility ever exists, there would need to be some consideration made for how to deal with it. – Marcel Wilson May 04 '15 at 20:17
8

You can use re.split in this case:

>>> from re import split
>>> data = '''\
... {J. Doe, R. Starr}, {Lorem
... {i}psum dolor }, Dol. sit., am. et.'''
>>> split(',\s*(?![^{}]*\})', data)
['{J. Doe, R. Starr}', '{Lorem\n{i}psum dolor }', 'Dol. sit.', 'am. et.']
>>>

Below is an explanation of what the Regex pattern matches:

,       # Matches ,
\s*     # Matches zero or more whitespace characters
(?!     # Starts a negative look-ahead assertion
[^{}]*  # Matches zero or more characters that are not { or }
\}      # Matches }
)       # Closes the look-ahead assertion
  • 2
    Won't this fail for more slightly more complicated examples of nested brackets? E.g. `"{J. Doe, R. Starr {x,{y}}}, {Lorem {i}psum dolor }, Dol. sit., am. et."`? – Alex Riley Nov 07 '14 at 20:13
  • 1
    @ajcr - Yes, it will fail. But that's why I said "in this case". The pattern I gave isn't bulletproof and can only handle simple strings. Specifically, it is meant for strings where there are no nested curly braces with commas, as in the OP's example. However, if the OP is working with strings as complex as that, it would be better to ditch Regex and build a parser instead. –  Nov 07 '14 at 20:16
  • I don't think you can generalize this to be a solution on any level. –  Nov 07 '14 at 20:21
  • 1
    If you're willing to accept a quick hack that doesn't work on more complex cases, why not go for the simplest? You don't really need to handle matched pairs of open and closed braces; just treat both open braces and closed braces as equivalent alternative "quote" characters, and skip any commas inside "quotes" the same way that PyParsing or `csv` or whatever does? – abarnert Nov 07 '14 at 20:37
3

Lucas Trzesniewski's comment can actually be used in Python with PyPi regex module (I just replaced named group with a numbered one to make it shorter):

>>> import regex
>>> r = regex.compile(r'({(?:[^{}]++|\g<1>)*})(*SKIP)(*FAIL)|\s*,\s*')
>>> s = """{J. Doe, R. Starr}, {Lorem
{i}psum dolor }, Dol. sit., am. et."""
>>> print(r.split(s))
['{J. Doe, R. Starr}', None, '{Lorem\n{i}psum dolor }', None, 'Dol. sit.', None, 'am. et.']

The pattern - ({(?:[^{}]++|\g<1>)*})(*SKIP)(*FAIL) - matches {...{...{}...}...} like structures (as { matches {, (?:[^{}]++|\g<1>)* matches 0+ occurrences of 2 alternatives: 1) any 1+ characters other than { and } (the [^{}]++), 2) text matching the whole ({(?:[^{}]++|\g<1>)*}) subpattern). The (*SKIP)(*FAIL) verbs make the engine omit the whole matched value from the match buffer, thus, moving the index to the end of the match and holding nothing to return (we "skip" what we matched).

The \s*,\s* matches a comma enclosed with 0+ whitespaces.

The None values appear because there is a capture group in the first branch that is empty when the second branch matches. We need to use a capture group in the first alternative branch for recursion. To remove the empty elements, use comprehension:

>>> print([x for x in r.split(s) if x])
['{J. Doe, R. Starr}', '{Lorem\n{i}psum dolor }', 'Dol. sit.', 'am. et.']
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563