2

I have a string that can look something like this:

1. "foo bar"
2. "foo bar foo:bar"
3. "foo bar "
4. "foo bar      "
5. "foo bar foo:bar:baz"

I want to split this string so that it would end up with the following results:

1. ['foo', 'bar']
2. ['foo', 'bar', 'foo', ':', 'bar']
3. / 4. ['foo', 'bar', '']
5. ['foo', 'bar', 'foo', ':', 'bar', ':', 'baz']

In other words, following these rules:

  1. Split the string on every occurrence of a space.

    a. If there are one or more spaces at the end of a string, add one empty string to the split list

    b. Any spaces before the last non-space character in a string should be consumed, and not add to the split list.

  2. Split the string on every occurrence of a colon, and do not consume the colon.

The XY problem is this, in case it's relevant:

I want to mimic Bash tab-completion behaviour. When you type a command into a Bash interpreter, it will split the command into an array COMP_WORDS, and it will follow the above rules - splitting the words based on spaces and colons, with colons placed into their own array element, and spaces ignored unless they're at the end of a string. I want to recreate this behaviour in Python, given a string that looks like a command that a user would type.

I've seen this question about splitting a string and keeping the separators using re.split. And this question about splitting using multiple delimiters. But my use case is more complicated, and neither question seems to cover it. I tried the following to at least split on spaces and colons:

print(re.split('(:)|(?: )', splitstr))

But even that doesn't work. When splitstr is "foo bar foo:bar" returns this:

['foo', None, 'bar', None, 'foo', ':', 'bar']

Any idea how this could be done in Python?

EDIT: My requirements weren't clear - I would want "foo bar " (with any number of spaces at the end) to return the list ["foo", "bar", ""] (with just one empty string at the end of the list.)

Lou
  • 2,200
  • 2
  • 33
  • 66
  • 1
    What if the string is `"foo bar "` i.e. many spaces at the end? – Nick Jan 13 '21 at 11:45
  • maybe look at https://stackoverflow.com/questions/8387924/python-argparse-and-bash-completion – buran Jan 13 '21 at 11:48
  • 1
    Maybe use `re.findall(r'[^:\s]+|:|(?<!\S)(?!\S)', text)`? See https://ideone.com/qttWe8 – Wiktor Stribiżew Jan 13 '21 at 11:48
  • @Nick - then I still want to just get one space, as in `['foo', 'bar', '']` – Lou Jan 13 '21 at 11:49
  • @Lou that's not a space, it's an empty string at the end of that list – Nick Jan 13 '21 at 11:50
  • 1
    Could you just not filter out the `None` in a list comprehension? `[x for x in result_from_re_split if x is not None]` – GuillemB Jan 13 '21 at 11:50
  • @buran - Thanks for the link. Argcomplete is okay, but not what I'm looking for in this situation (I've tried it, and it's just overkill). – Lou Jan 13 '21 at 11:51
  • @Nick - Of course you're right. My mistake. So what I want is an empty string at the end of the list if there are one or more spaces at the end of the string. – Lou Jan 13 '21 at 11:53
  • @GuillemB - That works too, good suggestion! – Lou Jan 13 '21 at 11:57

3 Answers3

3

There is no need to use regular expressions for this task. String methods work just as well, and might be more readable.

def split_comp(s: str) -> 'list[str]':
    trailing = s.endswith(' ')
    s = s.replace(':', ' : ')  # insert split marks before/after every colon
    parts = s.split()
    return parts if not trailing else [*parts, ' ']

This technique can be used for any delimiters – pick one delimiter to split on, then replace/pad those to remove/keep with it.

MisterMiyagi
  • 44,374
  • 10
  • 104
  • 119
  • Could you explain what the syntax in the function definition is doing? I've never seen this before `(s: str) -> 'list[str]'` – Lou Jan 13 '21 at 12:22
  • 1
    @Lou Those are type annotations to show the expected input/output types. They serve as documentation and allow type checkers to verify code. See [What are type hints in Python 3.5?](https://stackoverflow.com/questions/32557920/what-are-type-hints-in-python-3-5) and [What are variable annotations?](https://stackoverflow.com/questions/39971929/what-are-variable-annotations). – MisterMiyagi Jan 13 '21 at 12:23
  • Thanks, this solution works really neatly! The only other thing I don't get is why you have an asterisk before `parts`. It looks like an args and kwargs type thing, but I've only seen args used in function definitions like `def foo(*a, **b)` etc. – Lou Jan 13 '21 at 16:31
  • 1
    Ah never mind, I think I get it. The asterisk in `[*parts, ' ']` loads the elements of the list into a list, rather than the list itself - without the asterisk, you'd create a nested list. This is a really neat solution. I've accepted it because it's readable and concise. – Lou Jan 13 '21 at 16:37
1

You can use a re.findall approach here with:

[^:\s]+|:|(?<=\S)(?=\s+$)

See the regex demo. Details:

  • [^:\s]+ - one or more chars other than whitespace and :
  • | - or
  • : - a colon
  • | - or
  • (?<=\S)(?=\s+$) - any empty string that is located between a non-whitespace and one or more whitespaces at the end of string.

See the Python demo.

import re
l = ['foo bar', 'foo bar foo:bar', 'foo bar ', 'foo     bar     ']
rx = re.compile(r'[^:\s]+|:|(?<=\S)(?=\s+$)')
for s in l:
    if s.rstrip() != s:
        s = s.rstrip() + " "
    print(f"'{s}'", '=>', rx.findall(s))

Output:

'foo bar' => ['foo', 'bar']
'foo bar foo:bar' => ['foo', 'bar', 'foo', ':', 'bar']
'foo bar ' => ['foo', 'bar', '']
'foo     bar ' => ['foo', 'bar', '']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

Maybe there are shorter ways, but here is my suggestion:

def func(s):
    if s[-1]==' ':
        l=s.split()+['']
    else:
        l=s.split()
    def f(l):
        m=l.copy()
        res=[]
        for i in m:
            if i!=':' and ':' in i:
                temp=[i[:i.find(':')]]+[':']+[i[i.find(':')+1:]]
                res.extend(temp)
            else:
                res.append(i)
        return res
    while any(i!=':' and ':' in i for i in l):
        l=f(l)
    return l

Examples:

>>> func("foo bar")
['foo', 'bar']

>>> func("foo bar foo:bar")
['foo', 'bar', 'foo', ':', 'bar']

>>> func("foo bar ")
['foo', 'bar', '']

>>> func("foo bar      ")
['foo', 'bar', '']
IoaTzimas
  • 10,538
  • 2
  • 13
  • 30
  • Hi, thanks for the answer. I've tested it and it does work for the four cases you've shown, although unfortunately it doesn't work for contiguous strings with more than one colon in them, e.g.: `func("foo foo:bar:baz")` returns `['foo', 'foo', '', ':', '', 'bar', ':', 'baz']` – Lou Jan 13 '21 at 15:01
  • I did a change it should work now. The problem came from single ':' after the first split, which produced extra '' They are ignored now. – IoaTzimas Jan 13 '21 at 15:12